# It's a good idea to ensure you're running the latest version of any libraries you need.
# `!pip install -Uqq <libraries>` upgrades to the latest version of <libraries>
# NB: You can safely ignore any warnings or errors pip spits out about running as root or incompatibilities
!pip install -Uqq fastai fastbook duckduckgo_search timmPractical Deep Learnings For Coders - Part 1 Notes and Examples
Practical Deep Learning for Coders - Part 1
Vishal Bakshi
This notebook contains my notes (of course videos, example notebooks and book chapters) and exercises of Part 1 of the course Practical Deep Learning for Coders.
Lesson 1: Getting Started
Notebook Exercise
The first thing I did was to run through the lesson 1 notebook from start to finish. In this notebook, they download training and validation images of birds and forests then train an image classifier with 100% accuracy in identifying images of birds.
The first exercise is for us to create our own image classifier with our own image searches. I’ll create a classifier which accurately predicts an image of an alligator.
I’ll start by using their example code for getting images using DuckDuckGo image search:
from duckduckgo_search import ddg_images
from fastcore.all import *
def search_images(term, max_images=30):
print(f"Searching for '{term}'")
return L(ddg_images(term, max_results=max_images)).itemgot('image')The search_images function takes a search term and max_images maximum number of images value. It prints out a line of text that it’s "Searching for" the term and returns an L object with the image URL.
The ddg_images function returns a list of JSON objects containing the title, image URL, thumbnail URL, height, width and source of the image.
search_object = ddg_images('alligator', max_results=1)
search_object/usr/local/lib/python3.9/dist-packages/duckduckgo_search/compat.py:60: UserWarning: ddg_images is deprecated. Use DDGS().images() generator
warnings.warn("ddg_images is deprecated. Use DDGS().images() generator")
/usr/local/lib/python3.9/dist-packages/duckduckgo_search/compat.py:64: UserWarning: parameter page is deprecated
warnings.warn("parameter page is deprecated")
/usr/local/lib/python3.9/dist-packages/duckduckgo_search/compat.py:66: UserWarning: parameter max_results is deprecated
warnings.warn("parameter max_results is deprecated")
[{'title': 'The Creature Feature: 10 Fun Facts About the American Alligator | WIRED',
'image': 'https://www.wired.com/wp-content/uploads/2015/03/Gator-2.jpg',
'thumbnail': 'https://tse4.mm.bing.net/th?id=OIP.FS96VErnOXAGSWU092I_DQHaE8&pid=Api',
'url': 'https://www.wired.com/2015/03/creature-feature-10-fun-facts-american-alligator/',
'height': 3456,
'width': 5184,
'source': 'Bing'}]
Wrapping this list in L object and calling .itemgot('image') on it extracts URL value associated with the image key in the JSON object.
L(search_object).itemgot('image')(#1) ['https://www.wired.com/wp-content/uploads/2015/03/Gator-2.jpg']
Next, they provide some code to download the image to a destination filename and view the image:
urls = search_images('alligator', max_images=1)
from fastdownload import download_url
dest = 'alligator.jpg'
download_url(urls[0], dest, show_progress=False)
from fastai.vision.all import *
im = Image.open(dest)
im.to_thumb(256,256)Searching for 'alligator'

For my not-alligator images, I’ll use images of a swamp.
download_url(search_images('swamp photos', max_images=1)[0], 'swamp.jpg', show_progress=False)
Image.open('swamp.jpg').to_thumb(256,256)Searching for 'swamp photos'
/usr/local/lib/python3.9/dist-packages/duckduckgo_search/compat.py:60: UserWarning: ddg_images is deprecated. Use DDGS().images() generator
warnings.warn("ddg_images is deprecated. Use DDGS().images() generator")
/usr/local/lib/python3.9/dist-packages/duckduckgo_search/compat.py:64: UserWarning: parameter page is deprecated
warnings.warn("parameter page is deprecated")
/usr/local/lib/python3.9/dist-packages/duckduckgo_search/compat.py:66: UserWarning: parameter max_results is deprecated
warnings.warn("parameter max_results is deprecated")

In the following code, I’ll search for both terms, alligator and swamp and store the images in alligator_or_not/alligator and alligator_or_not/swamp paths, respectively.
The parents=TRUE argument creates any intermediate parent directories that don’t exist (in this case, the alligator_or_not directory). The exist_ok=TRUE argument suppresses the FileExistsError and does nothing.
searches = 'swamp','alligator'
path = Path('alligator_or_not')
from time import sleep
for o in searches:
dest = (path/o)
dest.mkdir(exist_ok=True, parents=True)
download_images(dest, urls=search_images(f'{o} photo'))
sleep(10) # Pause between searches to avoid over-loading server
download_images(dest, urls=search_images(f'{o} sun photo'))
sleep(10)
download_images(dest, urls=search_images(f'{o} shade photo'))
sleep(10)
resize_images(path/o, max_size=400, dest=path/o)Searching for 'swamp photo'
Searching for 'swamp sun photo'
Searching for 'swamp shade photo'
Searching for 'alligator photo'
Searching for 'alligator sun photo'
Searching for 'alligator shade photo'
Next, I’ll train my model using the code they have provided.
The get_image_files function is a fastai function which takes a Path object and returns an L object with paths to the image files.
type(get_image_files(path))fastcore.foundation.L
get_image_files(path)(#349) [Path('alligator_or_not/swamp/1b3c3a61-0f7f-4dc2-a704-38202d593207.jpg'),Path('alligator_or_not/swamp/9c9141f2-024c-4e26-b343-c1ca1672fde8.jpeg'),Path('alligator_or_not/swamp/1340dd85-5d98-428e-a861-d522c786c3d7.jpg'),Path('alligator_or_not/swamp/2d3f91dc-cc5f-499b-bec6-7fa0e938fb13.jpg'),Path('alligator_or_not/swamp/84afd585-ce46-4016-9a09-bd861a5615db.jpg'),Path('alligator_or_not/swamp/6222f0b6-1f5f-43ec-b561-8e5763a91c61.jpg'),Path('alligator_or_not/swamp/a71c8dcb-7bbb-4dba-8ae6-8a780d5c27c6.jpg'),Path('alligator_or_not/swamp/bbd1a832-a901-4e8f-8724-feac35fa8dcb.jpg'),Path('alligator_or_not/swamp/45b358b3-1a12-41d4-8972-8fa98b2baa52.jpg'),Path('alligator_or_not/swamp/cf664509-8eb6-42c8-9177-c17f48bc026b.jpg')...]
The fastai parent_label function takes a Path object and returns a string of the file’s parent folder name.
parent_label(Path('alligator_or_not/swamp/18b55d4f-3d3b-4013-822b-724489a23f01.jpg'))'swamp'
Some image files that are downloaded may be corrupted, so they have provided a verify_images function to find images that can’t be opened. Those images are then removed (unlinked) from the path.
failed = verify_images(get_image_files(path))
failed.map(Path.unlink)
len(failed)1
failed(#1) [Path('alligator_or_not/alligator/1eb55508-274b-4e23-a6ae-dbbf1943a9d1.jpg')]
dls = DataBlock(
blocks=(ImageBlock, CategoryBlock),
get_items=get_image_files,
splitter=RandomSplitter(valid_pct=0.2, seed=42),
get_y=parent_label,
item_tfms=[Resize(192, method='squish')]
).dataloaders(path, bs=32)
dls.show_batch(max_n=6)
I’ll train the model using their code which uses the resnet18 image classification model, and fine_tunes it for 3 epochs.
learn = vision_learner(dls, resnet18, metrics=error_rate)
learn.fine_tune(3)/usr/local/lib/python3.9/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
warnings.warn(
/usr/local/lib/python3.9/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet18_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet18_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 0.690250 | 0.171598 | 0.043478 | 00:03 |
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 0.127188 | 0.001747 | 0.000000 | 00:02 |
| 1 | 0.067970 | 0.006409 | 0.000000 | 00:02 |
| 2 | 0.056453 | 0.004981 | 0.000000 | 00:02 |
The accuracy is 100%.
Next, I’ll test the model as they’ve done in the lesson.
PILImage.create('alligator.jpg').to_thumb(256,256)
is_alligator,_,probs = learn.predict(PILImage.create('alligator.jpg'))
print(f"This is an: {is_alligator}.")
print(f"Probability it's an alligator: {probs[0]:.4f}")This is an: alligator.
Probability it's an alligator: 1.0000
Video Notes
In this section, I’ll take notes while I watch the lesson 1 video.
- This is the fifth version of the course!
- What seemed impossible in 2015 (image recognition of a bird) is now free and something we can build in 2 minutes.
- All models need numbers as their inputs. Images are already stored as numbers in computers. [PixSpy] allows you to (among other things) view the color of each pixel in an image file.
- A
DataBlockgives fastai all the information it needs to create a computer vision model. - Creating really interesting, real, working programs with deep learning is something that doesn’t take a lot of code, math, or more than a laptop computer. It’s pretty accessible.
- Deep Learning models are doing things that very few, if any of us, believed would be possible to do by computers in our lifetime.
- See the Practical Data Ethics course as well.
- Meta Learning: How To Learn Deep Learning And Thrive In The Digital World.
- Books on learning/education:
- Mathematician’s Lament by Paul Lockhart
- Making Learning Whole by David Perkins
- Why are we able to create a bird-recognizer in a minute or two? And why couldn’t we do it before?
- 2012: Project looking at 5-year survival of breast cancer patients, pre-deep learning approach
- Assembled a team to build ideas for thousands of features that required a lot of expertise, took years.
- They fed these features into a logistic regression model to predict survival.
- Neural networks don’t require us to build these features, they build them for us.
- 2015: Matthew D. Zeiler and Rob Fergus looked inside a neural network to see what it had learned.
- We don’t give it features, we ask it to learn features.
- The neural net is the basic function used in deep learning.
- You start with a random neural network, feed it examples and you have it learn to recognize things.
- The deeper you get, the more sophisticated the features it can find are.
- What we’re going to learn is how neural networks do this automatically.
- This is the key difference in why we can now do things that we couldn’t previously conceive of as possible.
- 2012: Project looking at 5-year survival of breast cancer patients, pre-deep learning approach
- An image recognizer can also be used to classify sounds (pictures of waveforms).
- Turning time series into pictures for image classification.
- fastai is built on top of PyTorch.
!pip install -Uqq fastaito update.- Always view your data at every step of building a model.
- For computer vision algorithms you don’t need particularly big images.
- For big images, most of the time is taken up opening it, the neural net on the GPU is must faster.
- The main thing you’re going to try and figure out is how do I get this data into my model?
DataBlockblocks=(ImageBlock, CategoryBlock):ImageBlockis the type of input to the model,CategoryBlockis the type of model outputget_image_files(path)returns a list of all image files in apath.- It’s critical that you put aside some data for testing the accuracy of your model (validation set) with something like
RandomSplitterfor thesplitterparameter. get_ytells fastai how to get the correct label for the photo.- Most computer vision architectures need all of your inputs to be the same size, using
Resize(eithercropout a piece in the middle orsquishthe image) for the parameteritem_tfms. DataLoaderscontains iterators that PyTorch can run through to grab batches of your data to feed the training algorithm.show_batchshows you a batch of input/label pairs.- A
Learnercombines a model (the actual neural network that we are training) and the data we use to train it with. - PyTorch Image Models (timm).
- resnet has already been trained to recognize over 1 million images of over 1000 different types. fastai downloads this so you can start with a neural network that can do a lot.
fine_tunetakes those pretrained weights downloaded for you and adjusts them in a carefully controlled way to teach the model differences between your dataset and what it was originally trained for.- You pass
.predictan image, which is how you would deploy your model, returns whether it’s a bird or not as a string, integer and probability of whether it’s a bird (in this example).
In the code blocks below, I’ll train the different types of models presented in the video lesson.
Image Segmentation
from fastai.vision.all import *
path = untar_data(URLs.CAMVID_TINY)
dls = SegmentationDataLoaders.from_label_func(
path, bs = 8, fnames = get_image_files(path/"images"),
label_func = lambda o: path/'labels'/f'{o.stem}_P{o.suffix}',
codes = np.loadtxt(path/'codes.txt', dtype=str)
)
learn = unet_learner(dls, resnet34)
learn.fine_tune(8)/usr/local/lib/python3.9/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
warnings.warn(
/usr/local/lib/python3.9/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet34_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet34_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/resnet34-b627a593.pth" to /root/.cache/torch/hub/checkpoints/resnet34-b627a593.pth
| epoch | train_loss | valid_loss | time |
|---|---|---|---|
| 0 | 3.454409 | 3.015761 | 00:06 |
| epoch | train_loss | valid_loss | time |
|---|---|---|---|
| 0 | 1.928762 | 1.719756 | 00:02 |
| 1 | 1.649520 | 1.394089 | 00:02 |
| 2 | 1.533350 | 1.344445 | 00:02 |
| 3 | 1.414438 | 1.279674 | 00:02 |
| 4 | 1.291168 | 1.063977 | 00:02 |
| 5 | 1.174492 | 0.980055 | 00:02 |
| 6 | 1.073124 | 0.931532 | 00:02 |
| 7 | 0.992161 | 0.922516 | 00:02 |
learn.show_results(max_n=3, figsize=(7,8))
It’s amazing how many it’s getting correct because this model was trained in about 24 seconds using a tiny amount of data.
I’ll take a look at the codes out of curiousity, which is an array of string elements describing different objects in view.
np.loadtxt(path/'codes.txt', dtype=str)array(['Animal', 'Archway', 'Bicyclist', 'Bridge', 'Building', 'Car',
'CartLuggagePram', 'Child', 'Column_Pole', 'Fence', 'LaneMkgsDriv',
'LaneMkgsNonDriv', 'Misc_Text', 'MotorcycleScooter', 'OtherMoving',
'ParkingBlock', 'Pedestrian', 'Road', 'RoadShoulder', 'Sidewalk',
'SignSymbol', 'Sky', 'SUVPickupTruck', 'TrafficCone',
'TrafficLight', 'Train', 'Tree', 'Truck_Bus', 'Tunnel',
'VegetationMisc', 'Void', 'Wall'], dtype='<U17')
Tabular Analysis
from fastai.tabular.all import *
path = untar_data(URLs.ADULT_SAMPLE)
dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, y_names='salary',
cat_names = ['workclass', 'education', 'marital-status', 'occupation',
'relationship', 'race'],
cont_names = ['age', 'fnlwgt', 'education-num'],
procs = [Categorify, FillMissing, Normalize])
dls.show_batch()| workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | State-gov | Some-college | Divorced | Adm-clerical | Own-child | White | False | 42.0 | 138162.000499 | 10.0 | <50k |
| 1 | Private | HS-grad | Married-civ-spouse | Other-service | Husband | Asian-Pac-Islander | False | 40.0 | 73025.003080 | 9.0 | <50k |
| 2 | Private | Assoc-voc | Married-civ-spouse | Prof-specialty | Wife | White | False | 36.0 | 163396.000571 | 11.0 | >=50k |
| 3 | Private | HS-grad | Never-married | Sales | Own-child | White | False | 18.0 | 110141.999831 | 9.0 | <50k |
| 4 | Self-emp-not-inc | 12th | Divorced | Other-service | Unmarried | White | False | 28.0 | 33035.002716 | 8.0 | <50k |
| 5 | ? | 7th-8th | Separated | ? | Own-child | White | False | 50.0 | 346013.994175 | 4.0 | <50k |
| 6 | Self-emp-inc | HS-grad | Never-married | Farming-fishing | Not-in-family | White | False | 36.0 | 37018.999571 | 9.0 | <50k |
| 7 | State-gov | Masters | Married-civ-spouse | Prof-specialty | Husband | White | False | 37.0 | 239409.001471 | 14.0 | >=50k |
| 8 | Self-emp-not-inc | Doctorate | Married-civ-spouse | Prof-specialty | Husband | White | False | 50.0 | 167728.000009 | 16.0 | >=50k |
| 9 | Private | HS-grad | Married-civ-spouse | Tech-support | Husband | White | False | 38.0 | 247111.001513 | 9.0 | >=50k |
For tabular models, there’s not generally going to be a pretrained model that already does something like what you want because every table of data is very different, so generally it doesn’t make too much sense to fine_tune a tabular model.
learn = tabular_learner(dls, metrics=accuracy)
learn.fit_one_cycle(2)| epoch | train_loss | valid_loss | accuracy | time |
|---|---|---|---|---|
| 0 | 0.373780 | 0.365976 | 0.832770 | 00:06 |
| 1 | 0.356514 | 0.358780 | 0.833999 | 00:05 |
Collaborative Filtering
The basis of most recommendation systems.
from fastai.collab import *
path = untar_data(URLs.ML_SAMPLE)
dls = CollabDataLoaders.from_csv(path/'ratings.csv')
dls.show_batch()| userId | movieId | rating | |
|---|---|---|---|
| 0 | 457 | 457 | 3.0 |
| 1 | 407 | 2959 | 5.0 |
| 2 | 294 | 356 | 4.0 |
| 3 | 78 | 356 | 5.0 |
| 4 | 596 | 3578 | 4.5 |
| 5 | 547 | 541 | 3.5 |
| 6 | 105 | 1193 | 4.0 |
| 7 | 176 | 4993 | 4.5 |
| 8 | 430 | 1214 | 4.0 |
| 9 | 607 | 858 | 4.5 |
There’s actually no pretrained collaborative filtering model so we could use fit_one_cycle but fine_tune works here as well.
learn = collab_learner(dls, y_range=(0.5, 5.5))
learn.fine_tune(10)| epoch | train_loss | valid_loss | time |
|---|---|---|---|
| 0 | 1.498450 | 1.417215 | 00:00 |
| epoch | train_loss | valid_loss | time |
|---|---|---|---|
| 0 | 1.375927 | 1.357755 | 00:00 |
| 1 | 1.274781 | 1.176326 | 00:00 |
| 2 | 1.033917 | 0.870168 | 00:00 |
| 3 | 0.810119 | 0.719341 | 00:00 |
| 4 | 0.704180 | 0.679201 | 00:00 |
| 5 | 0.640635 | 0.667121 | 00:00 |
| 6 | 0.623741 | 0.661391 | 00:00 |
| 7 | 0.620811 | 0.657624 | 00:00 |
| 8 | 0.606947 | 0.656678 | 00:00 |
| 9 | 0.605081 | 0.656613 | 00:00 |
learn.show_results()| userId | movieId | rating | rating_pred | |
|---|---|---|---|---|
| 0 | 15.0 | 35.0 | 4.5 | 3.886339 |
| 1 | 68.0 | 64.0 | 5.0 | 3.822170 |
| 2 | 62.0 | 33.0 | 4.0 | 3.088149 |
| 3 | 39.0 | 91.0 | 4.0 | 3.788227 |
| 4 | 37.0 | 7.0 | 5.0 | 4.434169 |
| 5 | 38.0 | 98.0 | 3.5 | 4.380877 |
| 6 | 3.0 | 25.0 | 3.0 | 3.443295 |
| 7 | 23.0 | 13.0 | 2.0 | 3.220192 |
| 8 | 15.0 | 7.0 | 4.0 | 4.306846 |
Note: RISE turnes your notebook into a presentation.
Generally speaking, if it’s something that a human can do reasonably quickly, even an expert human (like look at a Go board and decide if it’s a good board or not) then that’s probably something that deep learning will probably be good at. If it’s something that takes logical thought process over time, particularly if it’s not based on much data, deep learning probably won’t do that well.
The first neural network was built in 1957. The basic ideas have not changed much at all.
What’s going on in these models?
- Arthur Samuel in late 1950s invented Machine Learning.
- Normal program: input -> program -> results.
- Machine Learning model: input and weights (parameters) -> model -> results.
- The model is a mathematical function that takes the input, multiplies them with one set of weights and adds them up, then does that again for a second set of weights, and so forth.
- It takes all of the negative numbers and replaces them with 0.
- It takes all those numbers as inputs to the next layer.
- And it repeats a few times.
- Weights start out as being random.
- A more useful workflow: input/weights -> model -> results -> loss -> update weights.
- The loss is a number that says how good the results were.
- We need a way to come up with a new set of weights that are a bit better than the current weights.
- “bit better” weights means it makes the loss a bit better.
- If we make it a little bit better a few times, it’ll eventually get good.
- Neural nets proven to solve any computable function (i.e. it’s flexible enough to update weights until the results are good).
- “Generate artwork based on someone’s twitter bio” is a computable function.
- Once we’ve finished the training procedure we don’t the loss and the weights can be integrated into the model.
- We end up with inputs -> model -> results which looks like our original idea of a program.
- Deploying a model will have lots of tricky details but there will be one line of code which says
learn.predictwhich takes an input and provides results. - The most important thing to do is experiment.
Book Notes
Chapter 1: Your Deep Learning Journey In this section, I’ll take notes while I read Chapter 1 in the textbook.
Deep Learning is for Everyone
- What you don’t need for deep learning: lots of math, lots of data, lots of expensive computers.
- Deep learning is a computer technique to extract and transform data by using multiple layers of neural networks. Each of these layers takes its inputs from previous layers and progressively refines them. The layers are trained by algorithms that minimize their errors and improve their accuracy. In this way, the network learns to perform a specified task.
Neural Networks: A Brief History
- Warren McCulloch and Walter Pitts developed a mathematical model of an artificial neuron in 1943.
- Most of Pitt’s famous work was done while he was homeless.
- Psychologist Frank Rosenblatt further developed the artificial neuron to give it the ability to learn and built the first device that used these principles, the Mark I Perceptron, which was able to recognize simple shapes.
- Marvin Minsky and Seymour Papert wrote a book about the Perceptron and showed that using multiple layers of the devices would allow the limitations of a single layer to be addressed.
- The 1986 book Parallel Distributed Processing (PDP) by David Rumelhart, James McClelland, and the PDP Research Group defined PDP as requiring the following:
- A set of processing units.
- A state of activation.
- An output function for each unit.
- A pattern of connectivity among units.
- A propogation rule for propagating patterns of activities through the network of connectivities.
- An activation rule for combining the inputs impinging on a unit with the current state of that unit to produce an output for the unit.
- A learning rule whereby patterns of connectivity are modified by experience.
- An environment within which the system must operate.
How to Learn Deep Learning
- The hardest part of deep learning is artisanal: how do you know if you’ve got enough data, whether it is in the right format, if your model is training properly, and, if it’s not, what you should do about it?
from fastai.vision.all import *
path = untar_data(URLs.PETS)/'images'
def is_cat(x): return x[0].isupper()
dls = ImageDataLoaders.from_name_func(
path,
get_image_files(path),
valid_pct=0.2,
seed=42,
label_func=is_cat,
item_tfms=Resize(224)
)
dls.show_batch()
learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(1)/usr/local/lib/python3.10/dist-packages/fastai/vision/learner.py:288: UserWarning: `cnn_learner` has been renamed to `vision_learner` -- please update your code
warn("`cnn_learner` has been renamed to `vision_learner` -- please update your code")
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet34_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet34_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/resnet34-b627a593.pth" to /root/.cache/torch/hub/checkpoints/resnet34-b627a593.pth
100%|██████████| 83.3M/83.3M [00:00<00:00, 162MB/s]
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 0.140327 | 0.019135 | 0.007442 | 01:05 |
| epoch | train_loss | valid_loss | error_rate | time |
|---|
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 0.070464 | 0.024966 | 0.006766 | 01:00 |
The error rate is the proportion of images that were incorrectly identified.
Check this model actually works with an image of a dog or cat. I’ll download a picture from google and use it for prediction:
import ipywidgets as widgets
uploader = widgets.FileUpload()
uploaderim = PILImage.create(uploader.data[0])
is_cat, _, probs = learn.predict(im)
im.to_thumb(256)
print(f'Is this a cat?: {is_cat}.')
print(f"Probability it's a cat: {probs[1].item():.6f}")Is this a cat?: True.
Probability it's a cat: 1.000000
What is Machine Learning?
- A traditional program: inputs -> program -> results.
- In 1949, IBM researcher Arthur Samuel started working on machine learning. His basic idea was this: instead of telling the computer the exact steps required to solve a problem, show it examples of the problem to solve, and let it figure out how to solve it itself.
- In 1961 his checkers-playing program had learned so much that it beat the Connecticut state champion.
- Weights are just variables and a weight assignment is a particular choice of values for those variables.
- The program’s inputs are values that it processes in order to produce its results (for instance, taking image pixels as inputs, and returning the classification “dog” as a result).
- Because the weights affect the program, they are in a sense another kind of input.
- A program using weight assignment: inputs and weights -> model -> results.
- A model is a special kind of program, on that can do many different things depending on the weights.
- Weights = parameters, with the term “weights” reserved for a particulat type of model parameter.
- Learning would become entirely automatic when the adjustment of the weights was also automatic.
- Training a maching learning model: inputs and weights -> model -> results -> performance -> update weights.
- results are different than the performance of a model.
- Using a trained model as a program -> inputs -> model -> results.
- maching learning is the training of programs developed by allowing a computer to learn from its experience, rather than through manually coding the individual steps.
What is a Neural Network?
- Neural networks is a mathematical function that can solve any problem to any level of accuracy.
- Stochastic Gradient Descent (SGD) is a completely general way to update the weights of a neural network, to make it improve at any given task.
- Image classification problem:
- Our inputs are the images.
- Our weights are the weights in the neural net.
- Our model is a neural net.
- Our results are the values that are calculated by the neural net, like “dog” or “cat”.
A Bit of Deep Learning Jargon
- The functional form of the model is called its architecture.
- The weights are called parameters.
- The predictions are calculated from the independent variable, which is the data not including the labels.
- The results or the model are called predictions.
- The measure of performance is called the loss.
- The loss depends not only on the predictions, but also on the correct labels (also known as targets or the dependent variable).
- Detailed training loop: inputs and parameters -> architecture -> predictions (+ labels) -> loss -> update parameters.
Limitations Inherent to Machine Learning
- A model cannot be created without data.
- A model can learn to operate on only the patterns seen in the input data used to train it.
- This learning approach creates only predictions, not recommended actions.
- It’s not enough to just have examples of input data, we need labels for that data too.
- Positive feedback loop: the more the model is used, the more biased the data becomes, making the model even more biased, and so forth.
How Our Image Recognizer Works
item_tfmsare applied to each item whilebatch_tfmsare applied to a batch of items at a time using the GPU.- A classification model attempts to predict a class, or category.
- A regression model is one that attempts to predict one or more numeric quantities, such as temperature or location.
- The parameter
seed=42sets the random seed to the same value every time we run this code, which means we get the same validation set every time we run it. This way, if we change our model and retrain it, we know that any differences are due to the changes to the model, not due to having a different random validation set. - We care about how well our model works on previously unseen images.
- The longer you train for, the better your accuracy will get on the training set; the validation set accuracy will also improve for a while, but eventually it will start getting worse as the model starts to memorize the training set rather than finding generalizable underlying patterns in the data. When this happens, we say that the model is overfitting.
- Overfitting is the single most important and challenging issue when training for all machine learning practitioners, and all algorithms.
- You should only use methods to avoid overfitting after you have confirmed that overfitting is occurring (i.e., if you have observed the validation accuracy getting worse during training)
- fastai defaults to
valid_pct=0.2. - Models using architectures with more layers take longer to train and are more prone to overfitting, on the other hand, when using more data, they can be quite a bit more accurate.
- A metric is a function that measures the quality of the model’s predictions using the validation set.
- error_rate tells you what percentage of inputs in the validation set are being classified incorrectly.
- accuracy =
1.0 - error_rate. - The entire purpose of loss is to define a “measure of performance” that the training system can use to update weights automatically. A good choice for loss is a choice that is easy for stochastic gradient descent to use. But a metric is defined for human consumption, so a good metric is one that is easy for you to understand.
- A model that has weights that have already been trained on another dataset is called a pretrained model.
- When using a pretrained model,
cnn_learnerwill remove the last layer and replace it with one or more new layers with randomized weights. This last part of the model is known as the head. - Using a pretrained model for a task different from what is was originally trained for is known as transfer learning.
- The architecture only describes a template for a mathematical function; it doesn’t actually do anything until we provide values for the millions of parameters it contains.
- To fit a model, we have to provide at least one piece of information: how many times to look at each image (known as number of epochs).
fitwill fit a model (i.e., look at images in the training set multiple times, each time updating the parameters to make the predictions closer and closer to the target labels).- Fine-Tuning: a transfer learning technique that updates the parameters of a pretrained model by training for additional epochs using a different task from that used for pretraining.
fine_tunehas a few parameters you can set, but in the default form it does two steps:- Use one epoch to fit just those parts of the model necessary to get the new random head to work correctly with your dataset.
- Use the number of epochs requested when calling the method to fit the entire model, updating the weights of the later layers (especially the head) faster than the earlier layers (which don’t require many changes from the pretrained weights).
- The head of the model is the part that is newly added to be specific to the new dataset.
- An epoch is one complete pass through the dataset.
What Our Image Recognizer Learned
- When we fine tune our pretrained models, we adapt what the last layers focus on to specialize on the problem at hand.
Image Recognizers Can Tackle Non-Image Tasks
- A lot of things can be represented as images.
- Sound can be converted to a spectogram.
- Times series data can be created into an image using Gramian Angular Difference Field (GADF).
- If the human eye can recognize categories from the images, then a deep learning model should be able to do so too.
Jargon Recap
| Term | Meaning |
|---|---|
| Label | The data that we’re trying to predict |
| Architecture | The template of the model that we’re trying to fit; i.e., the actual mathematical function that we’re passing the input data and parameters to |
| Model | The combination of the architecture with a particular set of parameters |
| Parameters | The values in the model that change what task it can do and that are updated through model training |
| Fit | Update the parameters of the model such that the predictions of the model using the input data match the target labels |
| Train | A synonym for fit |
| Pretrained Model | A model that has already been trained, generally using a large dataset, and will be fine-tuned |
| Fine-tune | Update a pretrained model for a different task |
| Epoch | One complete pass through the input data |
| Loss | A measure of how good the model is, chosen to drive training via SGD |
| Metric | A measurement of how good the model is using the validation set, chosen for human consumption |
| Validation set | A set of data held out from training, used only for measuring how good the model is |
| Training set | The data used for fitting the model; does not include any data from the validation set |
| Overfitting | Training a model in such a way that it remembers specific features of the input data, rather than generalizing wel to data not seen during training |
| CNN | Convolutional neural network; a type of neural network that works particularly well for computer vision tasks |
Deep Learning is Not Just for Image Classification
- Segmentation
- Natural language processing (see below)
- Tabular (see Adults income classification above)
- Collaborative filtering (see MovieLens ratings predictor above)
- Start by using one of the cut-down dataset versions and later scale up to the full-size version. This is how the world’s top practitioners do their modeling in practice; they do most of their experimentation and prototyping with subsets of their data, and use the full dataset only when they have a good understanding of what they have to do.
Validation Sets and Test Sets
- If the model makes an accurate prediction for a data item, that should be because it has learned characteristics of that kind of item, and not because the model has been shaped by actually having seen that particular item.
- Hyperparameters: various modeling choices regarding network architecture, learning rates, data augmentation strategies, and other factors.
- We, as modelers, are evaluating the model by looking at predictions on the validation data when we decide to explore new hyperparameter values and we are in danger of overfitting the validation data through human trial and error and exploration.
- The test set can be used only to evaluate the model at the very end of our efforts.
- Training data is fully exposed to training and modeling processes, validation data is less exposed and test data is fully hidden.
- The test and validation sets should have enough data to ensure that you get a good estimate of your accuracy.
- The discipline of the test set helps us keep ourselves intellectually honest.
- It’s a good idea for you to try out a simple baseline model yourself, so you know what a really simply model can achieve.
Use Judgment in Defining Test Sets
- A key property of the validation and test sets is that they must be representative of the new data you will see in the future.
- As an example, for time series data, use earlier dates for training set and later more recent dates as validation set
- The data you will be making predictions for in production may be qualitatively different from the data you have to train your model with.
from fastai.text.all import *
# I'm using IMDB_SAMPLE instead of the full IMDB dataset since it either takes too long or
# I get a CUDA Out of Memory error if the batch size is more than 16 for the full dataset
# Using a batch size of 16 with the sample dataset works fast
dls = TextDataLoaders.from_csv(
path=untar_data(URLs.IMDB_SAMPLE),
csv_fname='texts.csv',
text_col=1,
label_col=0,
bs=16)
dls.show_batch()| text | category | |
|---|---|---|
| 0 | xxbos xxmaj raising xxmaj victor xxmaj vargas : a xxmaj review \n\n xxmaj you know , xxmaj raising xxmaj victor xxmaj vargas is like sticking your hands into a big , xxunk bowl of xxunk . xxmaj it 's warm and gooey , but you 're not sure if it feels right . xxmaj try as i might , no matter how warm and gooey xxmaj raising xxmaj victor xxmaj vargas became i was always aware that something did n't quite feel right . xxmaj victor xxmaj vargas suffers from a certain xxunk on the director 's part . xxmaj apparently , the director thought that the ethnic backdrop of a xxmaj latino family on the lower east side , and an xxunk storyline would make the film critic proof . xxmaj he was right , but it did n't fool me . xxmaj raising xxmaj victor xxmaj vargas is | negative |
| 1 | xxbos xxup the xxup shop xxup around xxup the xxup corner is one of the xxunk and most feel - good romantic comedies ever made . xxmaj there 's just no getting around that , and it 's hard to actually put one 's feeling for this film into words . xxmaj it 's not one of those films that tries too hard , nor does it come up with the xxunk possible scenarios to get the two protagonists together in the end . xxmaj in fact , all its charm is xxunk , contained within the characters and the setting and the plot … which is highly believable to xxunk . xxmaj it 's easy to think that such a love story , as beautiful as any other ever told , * could * happen to you … a feeling you do n't often get from other romantic comedies | positive |
| 2 | xxbos xxmaj now that xxmaj che(2008 ) has finished its relatively short xxmaj australian cinema run ( extremely limited xxunk screen in xxmaj xxunk , after xxunk ) , i can xxunk join both xxunk of " at xxmaj the xxmaj movies " in taking xxmaj steven xxmaj soderbergh to task . \n\n xxmaj it 's usually satisfying to watch a film director change his style / subject , but xxmaj soderbergh 's most recent stinker , xxmaj the xxmaj girlfriend xxmaj xxunk ) , was also missing a story , so narrative ( and editing ? ) seem to suddenly be xxmaj soderbergh 's main challenge . xxmaj strange , after 20 - odd years in the business . xxmaj he was probably never much good at narrative , just xxunk it well inside " edgy " projects . \n\n xxmaj none of this excuses him this present , | negative |
| 3 | xxbos i really wanted to love this show . i truly , honestly did . \n\n xxmaj for the first time , gay viewers get their own version of the " the xxmaj bachelor " . xxmaj with the help of his obligatory " hag " xxmaj xxunk , xxmaj james , a good looking , well - to - do thirty - something has the chance of love with 15 suitors ( or " mates " as they are referred to in the show ) . xxmaj the only problem is half of them are straight and xxmaj james does n't know this . xxmaj if xxmaj james picks a gay one , they get a trip to xxmaj new xxmaj zealand , and xxmaj if he picks a straight one , straight guy gets $ 25 , xxrep 3 0 . xxmaj how can this not be fun | negative |
| 4 | xxbos xxmaj many neglect that this is n't just a classic due to the fact that it 's the first 3d game , or even the first xxunk - up . xxmaj it 's also one of the first xxunk games , one of the xxunk definitely the first ) truly claustrophobic games , and just a pretty well - xxunk gaming experience in general . xxmaj with graphics that are terribly dated today , the game xxunk you into the role of xxunk even * think * xxmaj i 'm going to attempt spelling his last name ! ) , an xxmaj american xxup xxunk . caught in an underground bunker . xxmaj you fight and search your way through xxunk in order to achieve different xxunk for the six xxunk , let 's face it , most of them are just an excuse to hand you a weapon | positive |
| 5 | xxbos xxmaj i 'm sure things did n't exactly go the same way in the real life of xxmaj homer xxmaj hickam as they did in the film adaptation of his book , xxmaj rocket xxmaj boys , but the movie " october xxmaj sky " ( an xxunk of the book 's title ) is good enough to stand alone . i have not read xxmaj hickam 's memoirs , but i am still able to enjoy and understand their film adaptation . xxmaj the film , directed by xxmaj joe xxmaj xxunk and written by xxmaj lewis xxmaj xxunk , xxunk the story of teenager xxmaj homer xxmaj hickam ( jake xxmaj xxunk ) , beginning in xxmaj october of 1957 . xxmaj it opens with the sound of a radio broadcast , bringing news of the xxmaj russian satellite xxmaj xxunk , the first artificial satellite in | positive |
| 6 | xxbos xxmaj to review this movie , i without any doubt would have to quote that memorable scene in xxmaj tarantino 's " pulp xxmaj fiction " ( xxunk ) when xxmaj jules and xxmaj vincent are talking about xxmaj mia xxmaj wallace and what she does for a living . xxmaj jules tells xxmaj vincent that the " only thing she did worthwhile was pilot " . xxmaj vincent asks " what the hell is a pilot ? " and xxmaj jules goes into a very well description of what a xxup tv pilot is : " well , the way they make shows is , they make one show . xxmaj that show 's called a ' pilot ' . xxmaj then they show that show to the people who make shows , and on the strength of that one show they decide if they 're going to | negative |
| 7 | xxbos xxmaj how viewers react to this new " adaption " of xxmaj shirley xxmaj jackson 's book , which was promoted as xxup not being a remake of the original 1963 movie ( true enough ) , will be based , i suspect , on the following : those who were big fans of either the book or original movie are not going to think much of this one … and those who have never been exposed to either , and who are big fans of xxmaj hollywood 's current trend towards " special effects " being the first and last word in how " good " a film is , are going to love it . \n\n xxmaj things i did not like about this adaption : \n\n 1 . xxmaj it was xxup not a true adaption of the book . xxmaj from the xxunk i had | negative |
| 8 | xxbos xxmaj the trouble with the book , " memoirs of a xxmaj geisha " is that it had xxmaj japanese xxunk but underneath the xxunk it was all an xxmaj american man 's way of thinking . xxmaj reading the book is like watching a magnificent ballet with great music , sets , and costumes yet performed by xxunk animals dressed in those xxunk far from xxmaj japanese ways of thinking were the characters . \n\n xxmaj the movie is n't about xxmaj japan or real geisha . xxmaj it is a story about a few xxmaj american men 's mistaken ideas about xxmaj japan and geisha xxunk through their own ignorance and misconceptions . xxmaj so what is this movie if it is n't about xxmaj japan or geisha ? xxmaj is it pure fantasy as so many people have said ? xxmaj yes , but then why | negative |
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
learn.fine_tune(4, 1e-2)| epoch | train_loss | valid_loss | accuracy | time |
|---|---|---|---|---|
| 0 | 0.629276 | 0.553454 | 0.740000 | 00:19 |
| epoch | train_loss | valid_loss | accuracy | time |
|---|---|---|---|---|
| 0 | 0.466581 | 0.548400 | 0.740000 | 00:30 |
| 1 | 0.410401 | 0.418941 | 0.825000 | 00:30 |
| 2 | 0.286162 | 0.410872 | 0.830000 | 00:31 |
| 3 | 0.192047 | 0.405275 | 0.845000 | 00:31 |
# view actual vs prediction
learn.show_results()| text | category | category_ | |
|---|---|---|---|
| 0 | xxbos xxmaj this film sat on my xxmaj xxunk for weeks before i watched it . i xxunk a self - indulgent xxunk flick about relationships gone bad . i was wrong ; this was an xxunk xxunk into the screwed - up xxunk of xxmaj new xxmaj xxunk . \n\n xxmaj the format is the same as xxmaj max xxmaj xxunk ' " la xxmaj xxunk , " based on a play by xxmaj arthur xxmaj xxunk , who is given an " inspired by " credit . xxmaj it starts from one person , a prostitute , standing on a street corner in xxmaj brooklyn . xxmaj she is picked up by a home contractor , who has sex with her on the hood of a car , but ca n't come . xxmaj he refuses to pay her . xxmaj when he 's off xxunk , she | positive | positive |
| 1 | xxbos xxmaj bonanza had a great cast of wonderful actors . xxmaj xxunk xxmaj xxunk , xxmaj pernell xxmaj whitaker , xxmaj michael xxmaj xxunk , xxmaj dan xxmaj blocker , and even xxmaj guy xxmaj williams ( as the cousin who was brought in for several episodes during 1964 to replace xxmaj adam when he was leaving the series ) . xxmaj the cast had chemistry , and they seemed to genuinely like each other . xxmaj that made many of their weakest stories work a lot better than they should have . xxmaj it also made many of their best stories into great western drama . \n\n xxmaj like any show that was shooting over thirty episodes every season , there are bound to be some weak ones . xxmaj however , most of the time each episode had an interesting story , some kind of conflict , | positive | negative |
| 2 | xxbos i watched xxmaj grendel the other night and am compelled to put together a xxmaj public xxmaj service xxmaj announcement . \n\n xxmaj grendel is another version of xxmaj beowulf , the thousand - year - old xxunk - saxon epic poem . xxmaj the scifi channel has a growing catalog of xxunk and uninteresting movies , and the previews promised an xxunk low - budget mini - epic , but this one xxunk to let me switch xxunk . xxmaj it was xxunk , xxunk , bad . i watched in xxunk and horror at the train wreck you could n't tear your eyes away from . i reached for a xxunk and managed to capture part of what i was seeing . xxmaj the following may contain spoilers or might just save your xxunk . xxmaj you 've been warned . \n\n - xxmaj just to get | negative | negative |
| 3 | xxbos xxmaj this is the last of four xxunk from xxmaj france xxmaj i 've xxunk for viewing during this xxmaj christmas season : the others ( in order of viewing ) were the uninspired xxup the xxup black xxup tulip ( 1964 ; from the same director as this one but not nearly as good ) , the surprisingly effective xxup lady xxmaj oscar ( 1979 ; which had xxunk as a xxmaj japanese manga ! ) and the splendid xxup cartouche ( xxunk ) . xxmaj actually , i had watched this one not too long ago on late - night xxmaj italian xxup tv and recall not being especially xxunk over by it , so that i was genuinely surprised by how much i enjoyed it this time around ( also bearing in mind the xxunk lack of enthusiasm shown towards the film here and elsewhere when | positive | positive |
| 4 | xxbos xxmaj this is not really a zombie film , if we 're xxunk zombies as the dead walking around . xxmaj here the protagonist , xxmaj xxunk xxmaj louque ( played by an unbelievably young xxmaj dean xxmaj xxunk ) , xxunk control of a method to create zombies , though in fact , his ' method ' is to mentally project his thoughts and control other living people 's minds turning them into hypnotized slaves . xxmaj this is an interesting concept for a movie , and was done much more effectively by xxmaj xxunk xxmaj lang in his series of ' dr . xxmaj mabuse ' films , including ' dr . xxmaj mabuse the xxmaj xxunk ' ( 1922 ) and ' the xxmaj testament of xxmaj dr . xxmaj mabuse ' ( 1933 ) . xxmaj here it is unfortunately xxunk to his quest to | negative | positive |
| 5 | xxbos " once upon a time there was a charming land called xxmaj france … . xxmaj people lived happily then . xxmaj the women were easy and the men xxunk in their favorite xxunk : war , the only xxunk of xxunk which the people could enjoy . " xxmaj the war in question was the xxmaj seven xxmaj year 's xxmaj war , and when it was noticed that there were more xxunk of soldiers than soldiers , xxunk were sent out to xxunk the ranks . \n\n xxmaj and so it was that xxmaj fanfan ( gerard xxmaj philipe ) , caught xxunk a farmer 's daughter in a pile of hay , escapes marriage by xxunk in the xxmaj xxunk xxunk … but only by first believing his future as xxunk by a gypsy , that he will win fame and fortune in xxmaj his xxmaj | positive | positive |
| 6 | xxbos xxup ok , let me again admit that i have n't seen any other xxmaj xxunk xxmaj ivory ( the xxunk ) films . xxmaj nor have i seen more celebrated works by the director , so my capacity to xxunk xxmaj before the xxmaj rains outside of analysis of the film itself is xxunk . xxmaj with that xxunk , let me begin . \n\n xxmaj before the xxmaj rains is a different kind of movie that does n't know which genre it wants to be . xxmaj at first , it pretends to be a romance . xxmaj in most romances , the protagonist falls in love with a supporting character , is separated from the supporting character , and is ( sometimes ) united with his or her partner . xxmaj this movie 's hero has already won the heart of his lover but can not | negative | negative |
| 7 | xxbos xxmaj first off , anyone looking for meaningful " outcome xxunk " cinema that packs some sort of social message with meaningful performances and soul searching dialog spoken by dedicated , xxunk , heartfelt xxunk , please leave now . xxmaj you are wasting your time and life is short , go see the new xxmaj xxunk xxmaj jolie movie , have a good cry , go out & buy a xxunk car or throw away your conflict xxunk if that will make you feel better , and leave us alone . \n\n xxmaj do n't let the door hit you on the way out either . xxup the xxup incredible xxup melting xxup man is a grade b minus xxunk horror epic shot in the xxunk of xxmaj oklahoma by a young , xxup tv friendly cast & crew , and concerns itself with an astronaut who is | positive | negative |
| 8 | xxbos " national xxmaj treasure " ( 2004 ) is a thoroughly misguided xxunk - xxunk of plot xxunk that borrow from nearly every xxunk and dagger government conspiracy cliché that has ever been written . xxmaj the film stars xxmaj nicholas xxmaj cage as xxmaj benjamin xxmaj xxunk xxmaj xxunk ( how precious is that , i ask you ? ) ; a seemingly normal fellow who , for no other reason than being of a xxunk of like - minded misguided fortune hunters , decides to steal a ' national treasure ' that has been hidden by the xxmaj united xxmaj states xxunk fathers . xxmaj after a bit of subtext and background that plays laughably ( unintentionally ) like xxmaj indiana xxmaj jones meets xxmaj the xxmaj patriot , the film xxunk into one misguided xxunk after another attempting to create a ' stanley xxmaj xxunk | negative | negative |
review_text = "I really liked the movie!"
learn.predict(review_text)('positive', tensor(1), tensor([0.0174, 0.9826]))
Questionnaire
- Do you need these for deep learning?
- Lots of Math (FALSE).
- Lots of Data (FALSE).
- Lots of expensive computers (FALSE).
- A PhD (FALSE).
- Name five areas where deep learning is now the best tool in the world
- Natural Language Processing (NLP).
- Computer vision.
- Medicine.
- Image generation.
- Recommendation systems.
- What was the name of the first device that was based on the principle of the artificial neuron?
- Mark I Perceptron.
- Based on the book of the same name, what are the requirements for parallel distributed processing (PDP)?
- A series of processing units.
- A state of activation.
- An output function for each unit.
- A pattern of connectivity among units.
- A propagation rule for propagating patterns of activities through the network of connectivities.
- An activation rule for combining the inputs impinging on a unit with the current state of that unit to produce an output for the unit.
- A learning rule whereby patterns of connectivity are modified by experience.
- An environment within which the system must operate.
- What were the two theoretical misunderstandings that held back the field of neural networks?
- Using multiple layers of the device would allow limitations of one layer to be addressed—this was ignored.
- More than two layers are needed to get practical, good perforamnce—only in the last decade has this been more widely appreciated and applied.
- What is a GPU?
- A Graphical Processing Unit, which can perform thousands of tasks at the same time.
- Open a notebook and execute a cell containing:
1+1. What happens?- Depending on the server, it may take some time for the output to generate, but running this cell will output
2.
- Depending on the server, it may take some time for the output to generate, but running this cell will output
- Follow through each cell of the stripped version of the notebook for this chapter. Before executing each cell, guess what will happen.
- (I did this for the notebook shared for Lesson 1).
- Complete the Jupyter Notebook online appendix.
- Done. Will reference some of it again.
- Why is it hard to use a traditional computer program to recognize images in a photo?
- Because it’s hard to instruct a computer clear instructions to recognize images.
- What did Samuel mean by “weight assignment”?
- A particular choice for weights (variables)
- What term do we normally use in deep learning for what Samuel called “weights”?
- Parameters
- Draw a picture that summarizes Samuel’s view of a machine learning model
- input and weights -> model -> results -> performance -> update weights/inputs
- Why is it hard to understand why a deep learning model makes a particular prediction?
- Because a deep learning model has many layers and connectivities and activations between neurons that are not intuitive to our understanding.
- What is the name of the theorem that shows that a neural network can solve any mathematical problem to any level of accuracy?
- Universal approximation theorem.
- What do you need in order to train a model?
- Labeled data (Inputs and targets).
- Architecture.
- Initial weights.
- A measure of performance (loss, accuracy).
- A way to update the model (SGD).
- How could a feedback loop impact the rollout of a predictive policing model?
- The model will end up predicting where arrests are made, not where crime is taking place, so more police officers will go to locations where more arrests are predicted and feed that data back to the model which will reinforce the prediction of arrests in those areas, continuing this feedback loop of predictions -> arrests -> predictions.
- Do we always have to use 224x224-pixel images with the cat recognition model?
- No, that’s just the convention for image recognition models.
- You can use larger images but it will slow down the training process (it takes longer to open up bigger images).
- What is the difference between classification and regression?
- Classification predicts discrete classes or categories.
- Regression predicts continuous values.
- What is a validation set? What is a test set? Why do we need them?
- A validation set is a dataset upon which a model’s accuracy (or metrics in general) is calculated during training, as well as the dataset upon which the performance of different hyperparameters (like batch size and learning rate) are measured.
- A test set is a dataset upon which a model’s final performance is measured, a truly unseen dataset for both the model and the practitioner
- What will fastai do if you don’t provide a validation set?
- Set aside a random 20% of the data as the validation set by default
- Can we always use a random sample for a validation set? Why or why not?
- No, in situations where we want to ensure that the model’s accuracy is evaluated on data the model has not seen, we should not use a random validation set. Instead, we should create an intentional validation set. For example:
- For time series data, use the most recent dates as the validation set
- For human recognition data, use images of different people for training and validation sets
- No, in situations where we want to ensure that the model’s accuracy is evaluated on data the model has not seen, we should not use a random validation set. Instead, we should create an intentional validation set. For example:
- What is overfitting? Provide an example.
- Overfitting is when a model memorizes features of the training dataset instead of learning generalizations of the features in the data. An example of this is when a model memorizes training data facial features but then cannot recognize different faces in the real world. Another example is when a model memorizes the handwritten digits in the training data, so it cannot then recognize digits written in different handwriting. Overfitting can be observed during training when the validation loss starts to increase as the training loss decreases.
- What is a metric? How does it differ from loss?
- A metric a measurement of how good a model is performing, chosen for human consumption. A loss is also a measurement of how good a model is performing, but it’s chosen to drive training using an optimizer.
- How can pretrained models help?
- Pretrained models are already good at recognizing many generalized features and so they can help by providing a set of weights in an architecture that are capable, reducing the amount of time you need to train a model specific to your task.
- What is the “head” of the model?
- The last/top few neural network layers which are replaced with randomized weights in order to specialize your model via training on the task at hand (and not the task it was pretrained to perform).
- What kinds of features do the early layers of a CNN find? How about the later layers?
- Early layers: simple features lie lines, color gradients
- Later layers: compelx features like dog faces, outlines of people
- Are image models useful only for photos?
- No! Lots of things can be represented by images so if you can represent something (like a sound) as an image (spectogram) and differences between classes/categories are easily recognizable by the human eye, you can train an image classifier to recognize it.
- What is an architecture?
- A template, mathematical function, to which you pass input data to in order to fit/train a model
- What is segmentation?
- Recognizing different objects in an image based on pixel colors (each object is a different pixel color)
- What is
y_rangeused for? When do we need it?- It’s used to specify the output range of a regression model. We need it when the target is a continuous value.
- What are hyperparameters?
- Modeling choices such as network architecture, learning rates, data augmentation strategies and other higher level choices that govern the meaning of the weight parameters.
- What is the best way to avoid failures when using AI in an organization?
- Making sure you have good validation and test sets to evaluate the performance of a model on real world data.
- Trying out a simple baseline model to know what level of performance such a model can achieve.
Further Research
- Why is a GPU useful for deep learning? How is a CPU different, and why is it less effective for deep learning?
- CPU vs GPU for Machine Learning
- CPUs process tasks in a sequential manner, GPUs process tasks in parallel.
- GPUs can have thousands of cores, processing tasks at the same time.
- GPUs have many cores processing at low speeds, CPUs have few cores processing at high speeds.
- Some algorithms are optimized for CPUs rather than GPUs (time series data, recommendation systems that need lots of memory).
- Neural networks are designed to process tasks in parallel.
- CPU vs GPU in Machine Learning Algorithms: Which is Better?
- Machine Learning Operations Preferred on CPUs
- Recommendation systems that involve huge memory for embedding layers.
- Support vector machines, time-series data, algorithms that don’t require parallel computing.
- Recurrent neural networks because they use sequential data.
- Algorithms with intensive branching.
- Machine Learning Operations Preferred on GPUs
- Operations that involve parallelism.
- Machine Learning Operations Preferred on CPUs
- Why Deep Learning Uses GPUs
- Neural networks are specifically made for running in parallel.
- CPU vs GPU for Machine Learning
- Try to think of three areas where feedback loops might impact the use of machine learning. See if you can find documented examples of that happening in practice.
- Hidden Risks of Machine Learning Applied to Healthcare: Unintended Feedback Loops Between Models and Future Data Causing Model Degradation
- If clinicians fully trust the machine learning model (100% adoption of the predicted label) the false positive rate (FPR) grows uncontrollably with the number of updates.
- Runaway Feedback Loops in Predictive Policing
- Once police are deployed based on these predictions, data from observations in the neighborhood is then used to further update the model.
- Discovered crime data (e.g., arrest counts) are used to help update the model, and the process is repeated.
- Predictive policing systems have been empirically shown to be susceptible to runaway feedback loops, where police are repeatedly sent back to the same neighborhoods regardless of the true crime rate.
- Pitfalls of Predictive Policing: An Ethical Analysis
- Predictive policing relies on a large database of previous crime data and forecasts where crime is likely to occur. Since the program relies on old data, those previous arrests need to be unbiased to generate unbiased forecasts.
- People of color are arrested far more often than white people for committing the same crime.
- Racially biased arrest data creates biased forecasts in neighborhoods where more people of color are arrested.
- If the predictive policing algorithm is using biased data to divert more police forces towards less affluent neighborhoods and neighborhoods of color, then those neighborhoods are not receiving the same treatment as others.
- Bias in Criminal Risk Scores Is Mathematically Inevitable, Researchers Say
- The algorithm COMPAS which predicts whether a person is “high-risk” and deemed more likely to be arrested in the future, leads to being imprisoned (instead of sent to rehab) or longer sentences.
- Can bots discriminate? It’s a big question as companies use AI for hiring
- If an older candidate makes it past the resume screening process but gets confused by or interacts poorly with the chatbot, that data could teach the algorithm that candidates with similar profiles should be ranked lower
- Echo chambers, rabbit holes, and ideological bias: How YouTube recommends content to real users
- We find that YouTube’s algorithm pushes real users into (very) mild ideological echo chambers.
- We found that 14 out of 527 (~3%) of our users ended up in rabbit holes.
- Finally, we found that, regardless of the ideology of the study participant, the algorithm pushes all users in a moderately conservative direction.
- Hidden Risks of Machine Learning Applied to Healthcare: Unintended Feedback Loops Between Models and Future Data Causing Model Degradation
Lesson 2: Deployment
I’m going to do things a bit differently than how I approached Lesson 1. Jeremy suggested that we first watch the video without pausing in order to understand what we’re going to do and then watch it a second time and follow along. I also want to be mindful of how long I’m running my Paperspace Gradient maching (at $0.51/hour) so that I don’t run the machine when I don’t need its GPU.
So, here’s how I’m going to approach Lesson 2: - Read the Chapter 2 Questionnaire so I know what I’ll be “tested” on at the end - Watch the video without taking notes or running code - Rewatch the video and take notes in this notebook - Add the Kaggle code cells to this notebook and run them in Paperspace - Read the Gradio tutorial without running code - Re-read the Gradio tutorial and follow along with my own code - Read Chapter 2 in the textbook and run code in this notebook in Paperspace - Read Chapter 2 in the textbook and take notes in this notebook (including answers to the Questionnaire)
With this approach, I’ll have a big picture understanding of each step of the lesson and I’ll minimize the time I’m spending running my Paperspace Gradient machine.
Video Notes
- In this lesson we’re doing things that hasn’t been in courses like this before.
- Resource: aiquizzes.com—I signed up and answered a couple of questions.
- Don’t forget the FastAI Forums
- Click “Summarize this Topic” to get a list of the most upvoted posts
- How do we go about putting a model in production?
- Figure out what problem you want to solve
- Figure out how to get data for it
- Gather some data
- Use DuckDuckGo image function
- Download data
- Get rid of images that failed to open
- Data cleaning
- Before you clean your data, train the model
ImageClassifierCleanercan be used to clean (delete or re-label) the wrongly labeled data in the dataset- cleaner orders by loss so you only need to look at the first few
- Always build a model to find out what things are difficult to recognize in your data and to find the things the model can help you find that are problems in the data
- Train your model again
- Deploy to HuggingFace Spaces
- Install Jupyter Notebook Extensions to get features like table of contents and collapsible sections (with which you can also navigate sections using arrow keys)
- Type
??followed by function name to get source code - Type
?followed by function name to get brief info - If you have nbdev installed
doc(<fn>)will give you link to documentation - Different ways to resize an image
ResizeMethod.Squish(to see the whole picture with different aspect ratio)ResizeMethod.Pad(whole image in correct aspect ratio)
- Data Augmentation
RandomResizedCrop(different bit of an image everytime)batch_tfms=aug_tranforms()(images get turned, squished, warped, saturated, recolored, etc.)- Use if you are training for more than 5-10 epochs
- In memory, real-time, the image is being resized/cropped/etc.
- Confusion matrix (
ClassificationInterpretation)- Only meaningful for category labels
- Shows what category errors your model is making (actual vs predicted)
- In a lot of situations this will let you know what the hard categories to classify are (e.g. breeds of pets hard to identify)
.plot_top_lossestells us where the loss is the highest (prediction/actual/loss/probability)- A loss will be bad (high) if we are wrong + confident or right + unconfident
- On your computer, normal RAM doesn’t get filled up as it saves RAM to hard disk (swapping). GPUs don’t do swapping so do only one thing at a time so you’re not using up all the memory.
- Gradio + HuggingFace Spaces
- Here is my Hello World HuggingFace Space!
- Next, we’ll put a deep learning model in production. In the code cells below, I will train and export a dog vs cat classifier.
# import all the stuff we need from fastai
from fastai.vision.all import *
from fastbook import *# download and decompress our dataset
path = untar_data(URLs.PETS)/'images'# define a function to label our images
def is_cat(x): return x[0].isupper()# create `DataLoaders`
dls = ImageDataLoaders.from_name_func('.',
get_image_files(path),
valid_pct = 0.2,
seed = 42,
label_func = is_cat,
item_tfms = Resize(192))# view batch
dls.show_batch()
# train our model using resnet18 to keep it small and fast
learn = vision_learner(dls, resnet18, metrics = error_rate)
learn.fine_tune(3)/usr/local/lib/python3.9/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
warnings.warn(
/usr/local/lib/python3.9/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet18_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet18_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 0.199976 | 0.072374 | 0.020298 | 00:19 |
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 0.061802 | 0.081512 | 0.020974 | 00:20 |
| 1 | 0.047748 | 0.030506 | 0.010149 | 00:18 |
| 2 | 0.021600 | 0.026245 | 0.006766 | 00:18 |
# export our trained learner
learn.export('model.pkl')- Following the script in the video, as well as the
git-lfsandrequirements.txtin Tanishq Abraham’s tutorial, I deployed a Dog and Cat Classifier on HuggingFace Spaces. - If you run the training for long enough (high number of epochs) the error rate will get worse. We’ll learn why in a future lesson.
- Use fastsetup to setup your local machine with Python and Jupyter.
- They recommend using mamba instead of conda as it is faster.
Notebook Exercise
In the cells below, I’ll run the code provided in the Chapter 2 notebook.
# prepare path and subfolder names
bear_types = 'grizzly', 'black', 'teddy'
path = Path('bears')# download images of grizzly, black and teddy bears
if not path.exists():
path.mkdir()
for o in bear_types:
dest = (path/o)
dest.mkdir(exist_ok = True)
results = search_images_ddg(f'{o} bear')
download_images(dest, urls = results)# view file paths
fns = get_image_files(path)
fns(#570) [Path('bears/grizzly/ca9c20c9-e7f4-4383-b063-d00f5b3995b2.jpg'),Path('bears/grizzly/226bc60a-8e2e-4a18-8680-6b79989a8100.jpg'),Path('bears/grizzly/2e68f914-0924-42ed-9e2e-19963fa03a37.jpg'),Path('bears/grizzly/38e2d057-3eb2-4e8e-8e8c-fa409052aaad.jpg'),Path('bears/grizzly/6abc4bc4-2e88-4e28-8ce4-d2cbdb05d7b5.jpg'),Path('bears/grizzly/3c44bb93-2ac5-40a3-a023-ce85d2286846.jpg'),Path('bears/grizzly/2c7b3f99-4c8e-4feb-9342-dacdccf60509.jpg'),Path('bears/grizzly/a59f16a6-fa06-42d5-9d79-b84e130aa4e3.jpg'),Path('bears/grizzly/d1be6dc8-da42-4bee-ac31-0976b175f1e3.jpg'),Path('bears/grizzly/7bc0d3bd-a8dd-477a-aa16-449124a1afb5.jpg')...]
# get list of corrupted images
failed = verify_images(fns)
failed(#24) [Path('bears/grizzly/2e68f914-0924-42ed-9e2e-19963fa03a37.jpg'),Path('bears/grizzly/f77cfeb5-bfd2-4c39-ba36-621f117a65f6.jpg'),Path('bears/grizzly/37aa7eed-5a83-489d-b8f5-54020ba41390.jpg'),Path('bears/black/90a464ad-b0a7-4cf5-86ff-72d507857007.jpg'),Path('bears/black/f03a0ceb-4983-4b8f-a001-84a0875704e8.jpg'),Path('bears/black/6193c1cf-fda4-43f9-844e-7ba7efd33044.jpg'),Path('bears/teddy/474bdbb3-de2f-49e5-8c5b-62b4f3f50548.JPG'),Path('bears/teddy/58755f3f-227f-4fad-badc-a7d644e54296.JPG'),Path('bears/teddy/eb55dc00-3d01-4385-a7da-d81ac5211696.jpg'),Path('bears/teddy/97eadc96-dc4e-4b3f-8486-88352a3b2270.jpg')...]
# remove corrupted image files
failed.map(Path.unlink)(#24) [None,None,None,None,None,None,None,None,None,None...]
# create DataBlockfor training
bears = DataBlock(
blocks = (ImageBlock, CategoryBlock),
get_items = get_image_files,
splitter = RandomSplitter(valid_pct = 0.2, seed = 42),
get_y = parent_label,
item_tfms = Resize(128)
)# create DataLoaders object
dls = bears.dataloaders(path)# view training batch -- looks good!
dls.show_batch(max_n = 4, nrows = 1)
# view validation batch -- looks good!
dls.valid.show_batch(max_n = 4, nrows = 1)
# observe how images react to the "squish" ResizeMethod
bears = bears.new(item_tfms = Resize(128, ResizeMethod.Squish))
dls = bears.dataloaders(path)
dls.valid.show_batch(max_n = 4, nrows = 1)
Notice how the grizzlies in the third image look abnormally skinny, since the image is squished.
# observe how images react to the "pad" ResizeMethod
bears = bears.new(item_tfms = Resize(128, ResizeMethod.Pad, pad_mode = 'zeros'))
dls = bears.dataloaders(path)
dls.valid.show_batch(max_n = 4, nrows = 1)
In these images, the original aspect ratio is maintained.
# observe how images react to the transform RandomResizedCrop
bears = bears.new(item_tfms = RandomResizedCrop(128, min_scale = 0.3))
dls = bears.dataloaders(path)
dls.valid.show_batch(max_n = 4, nrows = 1)
# observe how images react to data augmentation transforms
bears = bears.new(item_tfms=Resize(128), batch_tfms = aug_transforms(mult = 2))
dls = bears.dataloaders(path)
# note that data augmentation occurs on training set
dls.train.show_batch(max_n = 8, nrows = 2, unique = True)
# train the model in order to clean the data
bears = bears.new(
item_tfms = RandomResizedCrop(224, min_scale = 0.5),
batch_tfms = aug_transforms())
dls = bears.dataloaders(path)
dls.show_batch()
# train the model
learn = vision_learner(dls, resnet18, metrics = error_rate)
learn.fine_tune(4)/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet18_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet18_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100%|██████████| 44.7M/44.7M [00:00<00:00, 100MB/s]
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 1.221027 | 0.206999 | 0.055046 | 00:34 |
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 0.225023 | 0.177274 | 0.036697 | 00:32 |
| 1 | 0.162711 | 0.189059 | 0.036697 | 00:31 |
| 2 | 0.144491 | 0.191644 | 0.027523 | 00:31 |
| 3 | 0.122036 | 0.188296 | 0.018349 | 00:31 |
# view Confusion Matrix
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()
The model confused a grizzly for a black bear and a black bear for a grizzly bear. It didn’t confuse any of the teddy bears, which makes sense given how different they look to real bears.
# view images with the highest losses
interp.plot_top_losses(5, nrows = 1)
The fourth image has two humans in it, which is likely why the model didn’t recognize the bear. The model correctly predicted the the third and fifth images but with low confidence (57% and 69%).
# clean the training and validation sets
from fastai.vision.widgets import *
cleaner = ImageClassifierCleaner(learn)
cleanerI cleaned up the images (deleting an image of a cat, another of a cartoon bear, a dog, and a blank image).
# delete or move images based on the dropdown selections made in the cleaner
for idx in cleaner.delete(): cleaner.fns[idx].unlink()
for idx,cat in cleaner.change(): shutil.move(str(cleaner.fns[idx]), path/cat)# create new dataloaders object
bears = bears.new(
item_tfms = RandomResizedCrop(224, min_scale = 0.5),
batch_tfms = aug_transforms())
dls = bears.dataloaders(path)
dls.show_batch()
# retrain the model
learn = vision_learner(dls, resnet18, metrics = error_rate)
learn.fine_tune(4)| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 1.289331 | 0.243501 | 0.074074 | 00:32 |
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 0.225567 | 0.256021 | 0.064815 | 00:32 |
| 1 | 0.218850 | 0.288018 | 0.055556 | 00:34 |
| 2 | 0.184954 | 0.315183 | 0.055556 | 00:31 |
| 3 | 0.141363 | 0.308634 | 0.055556 | 00:31 |
Weird!! After cleaning the data, the model got worse (1.8% error rate is now 5.6%). I’ll run the cleaning routine again and retrain the model to see if it makes a difference. Perhaps there are still erroneous images in the mix.
# view Confusion Matrix
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()
This time, the model incorrectly predicted 3 grizzlies as black bears, 2 black bears as grizzlies and 1 black bear as a teddy.
cleaner = ImageClassifierCleaner(learn)
cleaner# delete or move images based on the dropdown selections made in the cleaner
for idx in cleaner.delete(): cleaner.fns[idx].unlink()
for idx,cat in cleaner.change(): shutil.move(str(cleaner.fns[idx]), path/cat)# create new dataloaders object
bears = bears.new(
item_tfms = RandomResizedCrop(224, min_scale = 0.5),
batch_tfms = aug_transforms())
dls = bears.dataloaders(path)
# The lower right image (cartoon bear) is one that I selected "Delete" for
# in the cleaner so I'm not sure why it's still there
# I'm wondering if there's something wrong with the cleaner or how I'm using it?
dls.show_batch()
# retrain the model
learn = vision_learner(dls, resnet18, metrics = error_rate)
learn.fine_tune(4)| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 1.270627 | 0.130137 | 0.046729 | 00:31 |
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 0.183445 | 0.078030 | 0.028037 | 00:32 |
| 1 | 0.201080 | 0.053461 | 0.018692 | 00:33 |
| 2 | 0.183515 | 0.019479 | 0.009346 | 00:37 |
| 3 | 0.144900 | 0.012682 | 0.000000 | 00:31 |
I’m still not confident that this is a 100% accurate model given the bad images in the training set (such as the cartoon bear) but I’m going to go with it for now.
Book Notes
Chapter 2: From Model to Production
- Underestimating the constraints and overestimating the capabilities of deep learning may lead to frustratingly poor results, at least until you gain some experience and can solve the problems that arise.
- Overstimating the constraints and underestimating the capabilities of deep learning may mean you do not attempt a solvable problem because you talk yourself out of it.
- The most important thing (as you learn deep learning) is to ensure that you have a project to work on.
- The goal is not to find the “perfect” dataset or project, but just to get started and iterate from there.
- Complete every step as well as you can in a reasonable amount of time, all the way to the end.
- Computer vision
- Object recognition: recognize items in an image
- Object detection: recognition + highlight the location and name of each found object.
- Deep learning algorithms are generally not good at recognizing images that are significantly different in structure or style from those used to train the model.
- NLP
- Deep learning is not good at generating correct responses.
- Text generation models will always be technologically a bit ahead of models for recognizing automatically generated text.
- Google’s online translation system is based on deep learning.
- Combining text and images
- A deep learning model can be trained on input images with output captions written in English, and can learn to generate surprisingly appropriate captions automatically for new images (with no guarantee the captions will be correct).
- Deep learning should be used not as an entirely automated process, but as part of a process in which the model and a human user interact closely.
- Tabular data
- If you already have a system that is using random forests or gradient boosting machines then switching to or adding deep learning may not result in any dramatic improvement.
- Deep learning greatly increases the variety of columns that you can include.
- Deep learning models generally take longer to train than random forests or gradient boosting machines.
- Recommendation systems
- A special type of tabular data (a high-cardinality categorical variable representing users and another one representing products or something similar).
- Deep learning models are good at handling high cardinality categorical variables and thus recommendation systems.
- Deep learning models do well when combining these variables with other kinds of data such as natural language, images, or additional metadata represented as tables such as user information, previous transactions, and so forth.
- Nearly all machine learning approaches have th downside that they tell you only which products a particular user might like, rather than what recommendations would be helpful for a user.
- Other data types
- Using NLP deep learning methods is the current SOTA approach for many types of protein analysis since protein chains look a lot like natural language documents.
- The Drivetrain Approach
- Defined objective
- Levers (what inputs can we control)
- Data (what inputs we can collect)
- Models (how the levers influence the objective)
- Gathering data
- For most projects you can find the data online.
- Use
duckduckgo_search
- From Data to DataLoaders
DataLoadersis a thin class that just stores whateverDataLoaderobjects you pass to it and makes them available astrainandvalid.- To turn data into a
DataLoadersobject we need to tell fastai four things:- What kinds of data we are working with.
- How to get the list of items.
- How to label these items.
- How to create the validation set.
- With the
DataBlockAPI you can customize every stage of the creation of yourDataLoaders:
bears = DataBlock(
blocks=(ImageBlock, CategoryBlock),
get_items=get_image_files,
splitter=RandomSplitter(valid_pct=0.2, seed=42),
get_y=parent_label,
item_tfms=Resize(128))- explanation of
DataBlockblocksspecifies types for independent (the thing we are using to make predictions from) and dependent (our target) variables.- Computers don’t really know how to create random numbers at all, but simply create lists of numbers that look random; if you provide the same starting point for that list each time–called the seed–then you will get the exact same list each time.
- Images need to be all the same size.
- A
DataLoaderis a class that provides batches of a few items at a time to the GPU. - fastai default batch size is 64 items.
Resizecrops the images to fit a square shape, alternatively you can pad (ResizeMethod.Pad) or squish (ResizeMethod.Squish) the images to fit the square.- Squishing (model learns that things look differently from how they actually are), cropping (removal of features that would allow us to perform recognition) and padding (lot of empty space which is just wasted computation) are wasteful or problematic approaches. Instead, randomly select part of the image and then crop to just that part. On each epoch, we randomly select a different part of each image (
RandomResizedCrop(min_scale)). - Training the neural network with examples of images in which objects are in slightly different places and are slightly different sizes helps it to understand the basic concept of what an object is and how it can be represented in an image.
- Data Augmentation
- refers to creating random variations of our input data, such that they appear different but do not change the meaning of the data (rotation, flipping, perspective warping, brightness changes, and contrast changes).
aug_transforms()provides a standard set of augmentations.- Use
batch_tfmsto process a batch at a time on the GPU to save time.
- Training your model and using it to clean your data
- View confusion matrix with
ClassificationInterpretation.from_learner(learn). The diagonal shows images that are classified correctly. Calculated using validation set. - Sort images by loss using
interp.plot_top_losses(). - Loss is high if the model is incorrect (especially if it’s also confident) or if it’s correct but not confident.
- A model can help you find data issues more quickly.
- View confusion matrix with
- Using the model for inference
learn.export()will export a .pkl file.- Get predictions with
learn_inf.predict(<input>). This returns three things: the predicted category in the same format you originally provided, the index of the predicted category and the probabilities for each category. - You can access the
DataLoadersas an attribute of theLearner:learn_inf.dls.
- Deploying your app
- You almost certainly do not need a GPU to serve your model in production.
- To classify a few users’ images at a time, you need high-volume. If you do have this scenario, use Microsoft’s ONNX Runtime or AWS SageMaker.
- Recommended wherever possible to deploy the model itself to a server and have your mobile/edge application connect to it as a web service.
- If your application uses sensitive data, your users may be concerned about an approach that sends that data to a remote server.
- How to Avoid Disaster
- Understanding and testing the behavior of a deep learning model is much more difficult than with most other code you write.
- The kinds of photos that people are most likely to upload to the internet are the kinds of photos that do a good job of clearly and artistically displaying their subject matter, which isn’t the kind of input this system is going to be getting in real life. We may need to do a lot of our own data collection and labeling to create a useful system.
- out-of-domain data: data that our model sees in production that is very different from what it saw during training.
- domain shift: data that our model sees changes over time.
- Deployment process
- Manual Process: run model in parallel, humans check all predictions.
- Limited scope deployment: careful human supervision, time or geography limited.
- Gradual expansion: good reporting systems needed, consider what could go wrong.
- Unforeseen consequences and feedback loops
- Your model may change the behavior of the system it’s a part of.
- feedback loops can result in negative implications of bias getting worse.
- A helpful exercise prior to rolling out a significant machine learning system is to consider the question “What would happen if it went really, really well?”
- Questionnaire
- Where do text models currently have a major deficiency?
- Providing correct or accurate information.
- What are possible negative societal implications of text generation models?
- The viral spread of misinformation, which can lead to real actions and harms.
- In situations where a model might make mistakes, and those mistakes could be harmful, what is a god alternative to automating a process?
- Run the model in parallel with a human checking its predictions.
- What kind of tabular data is deep learning particularly good at?
- High-cardinality categorical data.
- What’s a key downside of directly using a deep learning model for recommendation systems?
- It will only tell you which products a particular user might like, rather than what recommendations may be helpful for a user.
- What are the steps of the Drivetrain Approach?
- Define an objective
- Determine what inputs (levers) you can control
- Collect data
- Create models (how the levers influence the objective)
- How do the steps of the Drivetrain Approach map to a recommendation system?
- Objective: drive additional sales due to recommendations.
- Level: ranking of the recommendations.
- Data: must be collectd to generate recommendations that will cause new sales.
- Models: two for purchasing probabilities conditional on seeing or not seeing a recommendation, the difference between these two probabilities is a utility function for a given recommendation to a customer (low in cases when algorithm recommends a familiar book that the customer has already rejected, or a book they would have bought even without the recommendation).
- Create an image recognition model using data you curate, and deploy it on the web.
- Here.
- What is
DataLoaders?- A class that creates validation and training sets/batches that are fed to the GPUS
- What four things do we need to tell fastai to create
DataLoaders?- What kinds of data we are working with (independent and dependent variables).
- How to get the list of items.
- How to label these items.
- How to create the validation set.
- What does the
splitterparameter toDataBlockdo?- Set aside a percentage of the data as the validation set.
- How do we ensure a random split always gives the same validation set?
- Set the
seedparameter to the same value.
- Set the
- What letters are often used to signify the independent and dependent variables?
- Independent: x
- Dependent: y
- What’s the difference between crop, pad and squish resize approaches? When might you choose one over the others?
- Crop: takes a section of the image and resizes it to the desired size. Use when it’s not necessary to have the model traing on the whole image.
- Pad: keep the image aspect ratio as is, add white/black padding to make a square. Use when it’s necessary to have the model train on the whole image.
- Squish: distorts the image to fit a square. Use when it’s not necessary to have the model train on the original aspect ratio.
- What is data augmentation? Why is it needed?
- Data augmentation is the creation of random variations of input data through techniques like rotation, flipping, brightness changes, contrast changes, perspective warping. It is needed to help the model learn to recognize objects under different lighting/perspective conditions.
- Provide an example of where the bear classification model might work poorly in production, due to structural or style differences in the training data.
- What is the difference between
item_tfmsandbatch_tfms?item_tfmsare transforms that are applied to each item in the set.batch_tfmsare transforms applied to a batch of items in the set.
- What is a confusion matrix?
- A matrix that shows the counts of predicted (columns) vs. actual (rows) labels, with the diagonal being correctly predicted data.
- What does
exportsave?- Both the architecture and the parameters as a
.pklfile.
- Both the architecture and the parameters as a
- What is called when we use a model for making predictions, instead of training?
- Inference
- What are IPython widgets?
- interactive browser controls for Jupyter Notebooks.
- When would you use a CPU for deployment? When might a GPU be better?
- CPU: low-volume, single-user inputs for prediction.
- GPU: high-volume, multiple-user inputs for predictions.
- What are the downsides of deploying your app to a server, instead of to a client (or edge) device such as a phone or PC?
- Requires internet connectivity (and latency).
- Sensitive data transfer may not be okay with your users.
- Managing complexity and scaling the server creates additional overhead.
- What are three examples of problems that could occur when rolling out a bear warning system in practice?
- out-of-domain data: the images captured of real bears may not be represented in the model’s training or validation datasets.
- Number of bear alerts doubles or halves after rollout of the new system in some location.
- out-of-domain data: the cameras may capture low-resolution images of the bears when the training and validation set had high resolution images.
- What is out-of-domain data?
- Data your model sees in production that it hasn’t seen during training.
- What is domain shift?
- Changes in the data that our model sees in production over time.
- What are the three steps in the deployment process?
- Manual Process
- Limited scope deployment
- Gradual expansion
- Where do text models currently have a major deficiency?
- Further Research
- Consider how the Drivetrain Approach maps to a project or problem you’re interested in.
- I’ll take the example of a project I will be working on to practice what I’m learning in this book: training a deep learning model which correctly classifies the typeface from a collection of single letter.
- The objective: correctly classify typeface from a collection of single letters.
- Levers: observe key features of key letters that are the “tell” of a typeface.
- Data: using an HTML canvas object and Adobe Fonts, generate images of single letters of multiple fonts associated with each category of typeface.
- Models: output the probabilities of each typeface a given collection of single letters is predicted as. This allows for some flexibility in how you categorize letters based on the shared characteristics of more than one typeface that the particular font may possess.
- I’ll take the example of a project I will be working on to practice what I’m learning in this book: training a deep learning model which correctly classifies the typeface from a collection of single letter.
- When might it be best to avoid certain types of data augmentation?
- In my typeface example, it’s best to avoid perspective warping because it will change key features used to recognize a typeface.
- For a project you’re interested in applying deep learning to, consider the thought experiment, “What would happen if it went really, really well?”
- If my typeface classifier works really well, I imagine it would be used by people to take pictures of real-world text and learn what typeface it is. This may inspire a new wave of typeface designers. If a feedback loop was possible, and the classifier went viral, the very definition of typefaces may be affected by popular opinion. Taken a step further, a generative model may be inspired by this classifier, and a new wave of AI typeface would be launched—however this last piece is highly undesirable unless the training of the model involves appropriate licensing and attribution of the typefaces used that are created by humans. Furthermore, from what I understand from reading about typefaces, the process of creating a typeface is an amazing experience and should not be replaced with AI generators. If I created such a generative model (in part 2 of the course) and it went viral (do HuggingFace Spaces go viral? Cuz that’s where I would launch it), I would take it down.
- Start a blog (done!)
- Consider how the Drivetrain Approach maps to a project or problem you’re interested in.
Lesson 3: Neural Net Foundations
Video Notes
- How to do a fast.ai lesson
- Watch lecture
- Run notebook & experiment
- Reproduce results
- Repeat with different dataset
- fastbook repo contains “clean” folder with notebooks without markdown text.
- Two concepts: training the model and using it for inference.
- Over 500 architectures in
timm(PyTorch Image Models). timm.list_models(pattern)will list models matching the pattern.- Pass string name of timm model to the
Learnerlike:vision_learner(dls, 'timm model string', ...). in22= ImageNet with 22k categories,1k= ImageNet with 1k categories.learn.predictprobabilities are in the order oflearn.dls.vocab.learn.modelcontains the trained model which contains lots of nested layers.learn.model.get_submoduletakes a dotted string navigating through the hierarchy.- Machine learning models fit functions to data.
- Things between dollar signs is LaTeX
"$...$". - General form of quadratic:
def quad(a,b,c,x): return a*x**2 + b*x + c partialfromfunctoolsfixes parameters to a function.- Loss functions tells us how good our model is.
@interactfromipywidgetsallows sliders tied to the function its above.- Mean Squared Error:
def mse(preds, acts): return ((preds - acts)**2).mean() - For each parameter we need to know: does the loss get better when we increase or decrease the parameter?
- The derivative is the function that tells you: if you increase the input does the output increase or decrease, and by how much?
*paramsspreads out the list into its elements and passes each to the function.- 1-D (rank 1) tensor (lists of numbers), 2-D tensor (tables of numbers) 3-D tensor (layers of tables of numbers) and so on.
tensor.requires_grad_()calculates the gradient of the values in the tensor whenever its used in calculation.loss.backward()calculates gradients on the inputs to the loss function.abc.gradattribute added after gradients are calculated.- negative gradient means increasing the parameter will decrease the loss.
- update parameters
with torch.no_grad()so PyTorch doesn’t calculate the gradient (since it’s being used in a function). We don’t want the derivative of the parameter update, we only want the derivative with respect to the loss. - Automate the steps
- Calculate Mean Squared Error
- Call
.backward. - Subtract gradient * small number from the parameters
- All optimizers are built on the concept of gradient descent (calculate gradients and decrease the loss).
- We need a better function than quadratics
- Rectified Linear Unit:
def rectified_linear(m,b,x):
y = m*x + b
return torch.clip(y, 0.)torch.clipturns values less than value specified to the value specified (in this case, it turns negative values to 0.).- Adding rectified linear functions together gives us an arbitrarily squiggly function that will match as close as we want to the data.
- ReLU in 2D gives you surfaces, volumes in 3D, etc.
- With this incredibly simple foundation you can construct an arbitrarily precise, accurate model.
- When you have ReLU’s getting added together, and gradient descent to optimize the parameters, and samples of inputs and outputs that you want, the computer “draws the owl” so to speak.
- Deep learning is using gradient descent to set some parameters to make a wiggly function (the addition of lots of rectified linear units or something very similar to that) that matches your data.
- When selecting an architecture, the biggest beginner mistake is that they jump to the highest-accuracy models.
- At the start of the project, just use resnet18 so you can spend all of your time trying things out (data augmentation, data cleaning, different external data) as fast as possible.
- Trying better architectures is the very last thing to do.
- How do I know if I have enough data?
- Vast majority of projects in industry wait far too long until they train their first model.
- Train your first model on day 1 with whatever CSV files you can hack together.
- Semi-supervised training lets you get dramatically more out of your data.
- Often it’s easy to get lots of inputs but hard to get lots of outputs (labels).
- Units of parameter gradients: for each increase in parameter of 1, the gradient is the amount the loss would change by (if it stayed at that slope—which it doesn’t because it’s a curve).
- Once you get close enough to the optimal parameter value, all loss functions look like quadratics
- The slope of the loss function decreases as you approach the optimal
- Learning rate (a hyperparameter) is multiplied by the gradient, the product of which is subtracted from the parameters
- If you pick a learning rate that’s too large, you will diverge; if you pick too small, it’ll take too long to train.
- http://matrixmultiplication.xyz/
- Matrix multiplication is the critical foundational mathematical operation in deep learning
- GPUs are good at matrix multiplication with tensor cores (multiply together two 4x4 matrices)
- Use a spreadsheet to train a deep learning model on the Kaggle Titanic dataset in which you’re trying to predict if a person survived.
- Columns included (convert some of them to binary categorical variables):
- Survivor
- Pclass
- Convert to Pclass_1 and Pclass_2 (both 1/0).
- Sex
- Convert to Male (0/1) column.
- Age
- Remove blanks.
- Normalize (Age/Max(Age))
- SibSp (how many siblings they have)
- Parch (# of parents/children aboard)
- Fare
- Lots of very small and very large fares, log of it has a much more even distribution. (LOG10(Fare + 1).
- Embarked (which city they got on at)
- Remove blanks.
- Convert to Embark_S and Embark_C (both 1/0)
- Ones
- Add a column of 1s.
- Create random numbers for params (including Const) with
=RAND() - 0.5. - Regression
- Use
SUMPRODUCTto calculate linear function. - Loss of linear function is (linear function result - Survived) ^ 2.
- Average loss = AVERAGE(individual losses).
- User “Solver” with GRG Nonlinear Solving Method. Set Objective to minimize the cell with average loss. Change parameter variables.
- Use
- Neural Net
- Two sets of params.
- Two linear columns.
- Two ReLU columns.
- Adding two linear functions together gives you a linear function, we want all those wiggles (non-linearity) so we use ReLUs.
- ReLU:
IF(lin1 < 0, 0, lin1) - Preds = sum of the two ReLUs.
- Loss same as regression.
- Solver process the same as well.
- Neural Net (Matrix Multiplication)
- Transpose params into two columns.
=MMULT(...)for Lin1 and Lin2 columns.- Keep ReLU, Preds and Loss column the same.
- Optimize params using Solver.
- Helpful reminder to build intuition around matrix multiplication: it’s doing the same thing as the
SUMPRODUCTs.
- Dummy variables: Pclass_1, Pclass_2, etc.
- Columns included (convert some of them to binary categorical variables):
- Next lesson: NLP
- It’s about making predictions with text data which most of the time is in the form of prose.
- First Farsi NLP resource was created by a student of the first fastai course.
- NLP most commonly and practically used for classification.
- Document = one or two words, a book, a wikipedia page, any length.
- Classification = figure out a category for a document.
- Sentiment analysis
- Author identification
- Legal discovery (is this document in-scope or out-of-scope)
- Organizing documents by topic
- Triaging inbound emails
- Classification of text looks similar to images.
- We’re going to use a different library: HuggingFace Transformers
- Helpful to see how things are done in more than one library.
- HuggingFace Transformers doesn’t have the same high-level API. Have to do more stuff manually. Which is good for students at this point of the course.
- It’s a good library.
- Before the next lesson take a look at the NLP notebook and U.S. Patent to Phrase Matching data.
- Trying to figure out in patents whether two concepts are referring to the same thing. The document is text1, text2, and the category is similar (1) or not-similar (0).
- Will also talk about the two very important topics of validation sets and metrics.
Notebook Exercise
Training and Deploying: Pets Classifier
In this section, I’ll train a Pets dataset classifier as done by Jeremy in this notebook.
from fastai.vision.all import *
import timmpath = untar_data(URLs.PETS)/'images'
# Create DataLoaders object
dls = ImageDataLoaders.from_name_func('.',
get_image_files(path),
valid_pct=0.2,
seed=42,
label_func=RegexLabeller(pat = r'^([^/]+)_\d+'),
item_tfms=Resize(224))dls.show_batch(max_n=4)
# train using resnet34 as architecture
learn = vision_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(3)/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet34_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet34_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/resnet34-b627a593.pth" to /root/.cache/torch/hub/checkpoints/resnet34-b627a593.pth
100%|██████████| 83.3M/83.3M [00:00<00:00, 196MB/s]
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 1.496086 | 0.316146 | 0.100135 | 01:12 |
| epoch | train_loss | valid_loss | error_rate | time |
|---|
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 0.441153 | 0.315289 | 0.093369 | 01:04 |
| 1 | 0.289844 | 0.215224 | 0.069012 | 01:05 |
| 2 | 0.123374 | 0.191152 | 0.060217 | 01:03 |
The pets classifier, using resnet34 and 3 epochs, is about 94% accurate.
# train using a timm architecture
# from the convnext family of architectures
learn = vision_learner(dls, 'convnext_tiny_in22k', metrics=error_rate).to_fp16()
learn.fine_tune(3)/usr/local/lib/python3.10/dist-packages/timm/models/_factory.py:114: UserWarning: Mapping deprecated model name convnext_tiny_in22k to current convnext_tiny.fb_in22k.
model = create_fn(
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 1.130913 | 0.240275 | 0.085927 | 01:06 |
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 0.277886 | 0.193888 | 0.061570 | 01:08 |
| 1 | 0.196232 | 0.174544 | 0.055480 | 01:09 |
| 2 | 0.127525 | 0.156720 | 0.048038 | 01:07 |
Using convnext_tiny_in22k, the model is about 95.2% accurate, about a 20% decrease in error rate.
# export to use in gradio app
learn.export('pets_model.pkl')You can view my pets classifier gradio app here.
Which image models are best?
In this section, I’ll plot the timm model results as shown in Jeremy’s notebook.
import pandas as pd# load data
df_results = pd.read_csv("../../../fastai-course/data/results-imagenet.csv")
df_results.head()| model | top1 | top1_err | top5 | top5_err | param_count | img_size | crop_pct | interpolation | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | eva02_large_patch14_448.mim_m38m_ft_in22k_in1k | 90.052 | 9.948 | 99.048 | 0.952 | 305.08 | 448 | 1.0 | bicubic |
| 1 | eva02_large_patch14_448.mim_in22k_ft_in22k_in1k | 89.966 | 10.034 | 99.012 | 0.988 | 305.08 | 448 | 1.0 | bicubic |
| 2 | eva_giant_patch14_560.m30m_ft_in22k_in1k | 89.786 | 10.214 | 98.992 | 1.008 | 1,014.45 | 560 | 1.0 | bicubic |
| 3 | eva02_large_patch14_448.mim_in22k_ft_in1k | 89.624 | 10.376 | 98.950 | 1.050 | 305.08 | 448 | 1.0 | bicubic |
| 4 | eva02_large_patch14_448.mim_m38m_ft_in1k | 89.570 | 10.430 | 98.922 | 1.078 | 305.08 | 448 | 1.0 | bicubic |
top1 = what percent of the time the model predicts the correct label with the highest probability.
top5 = what percent of the time the model predits the correct label with the top 5 highest probabilities.
# remove additional text from model name
df_results['model_org'] = df_results['model']
df_results['model'] = df_results['model'].str.split('.').str[0]
df_results.head()| model | top1 | top1_err | top5 | top5_err | param_count | img_size | crop_pct | interpolation | model_org | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | eva02_large_patch14_448 | 90.052 | 9.948 | 99.048 | 0.952 | 305.08 | 448 | 1.0 | bicubic | eva02_large_patch14_448.mim_m38m_ft_in22k_in1k |
| 1 | eva02_large_patch14_448 | 89.966 | 10.034 | 99.012 | 0.988 | 305.08 | 448 | 1.0 | bicubic | eva02_large_patch14_448.mim_in22k_ft_in22k_in1k |
| 2 | eva_giant_patch14_560 | 89.786 | 10.214 | 98.992 | 1.008 | 1,014.45 | 560 | 1.0 | bicubic | eva_giant_patch14_560.m30m_ft_in22k_in1k |
| 3 | eva02_large_patch14_448 | 89.624 | 10.376 | 98.950 | 1.050 | 305.08 | 448 | 1.0 | bicubic | eva02_large_patch14_448.mim_in22k_ft_in1k |
| 4 | eva02_large_patch14_448 | 89.570 | 10.430 | 98.922 | 1.078 | 305.08 | 448 | 1.0 | bicubic | eva02_large_patch14_448.mim_m38m_ft_in1k |
def get_data(part, col):
# get benchmark data and merge with model data
df = pd.read_csv(f'../../../fastai-course/data/benchmark-{part}-amp-nhwc-pt111-cu113-rtx3090.csv').merge(df_results, on='model')
# convert samples/sec to sec/sample
df['secs'] = 1. / df[col]
# pull out the family name from the model name
df['family'] = df.model.str.extract('^([a-z]+?(?:v2)?)(?:\d|_|$)')
# removing `resnetv2_50d_gn` and `resnet50_gn` for some reason
df = df[~df.model.str.endswith('gn')]
# not sure why the following line is here, "in22" was removed in cell above
df.loc[df.model.str.contains('in22'),'family'] = df.loc[df.model.str.contains('in22'),'family'] + '_in22'
df.loc[df.model.str.contains('resnet.*d'),'family'] = df.loc[df.model.str.contains('resnet.*d'),'family'] + 'd'
# only returns subset of families
return df[df.family.str.contains('^re[sg]netd?|beit|convnext|levit|efficient|vit|vgg|swin')]# load benchmark inference data
df = get_data('infer', 'infer_samples_per_sec')
df.head()| model | infer_samples_per_sec | infer_step_time | infer_batch_size | infer_img_size | param_count_x | top1 | top1_err | top5 | top5_err | param_count_y | img_size | crop_pct | interpolation | model_org | secs | family | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 12 | levit_128s | 21485.80 | 47.648 | 1024 | 224 | 7.78 | 76.526 | 23.474 | 92.872 | 7.128 | 7.78 | 224 | 0.900 | bicubic | levit_128s.fb_dist_in1k | 0.000047 | levit |
| 13 | regnetx_002 | 17821.98 | 57.446 | 1024 | 224 | 2.68 | 68.746 | 31.254 | 88.536 | 11.464 | 2.68 | 224 | 0.875 | bicubic | regnetx_002.pycls_in1k | 0.000056 | regnetx |
| 15 | regnety_002 | 16673.08 | 61.405 | 1024 | 224 | 3.16 | 70.278 | 29.722 | 89.528 | 10.472 | 3.16 | 224 | 0.875 | bicubic | regnety_002.pycls_in1k | 0.000060 | regnety |
| 17 | levit_128 | 14657.83 | 69.849 | 1024 | 224 | 9.21 | 78.490 | 21.510 | 94.012 | 5.988 | 9.21 | 224 | 0.900 | bicubic | levit_128.fb_dist_in1k | 0.000068 | levit |
| 18 | regnetx_004 | 14440.03 | 70.903 | 1024 | 224 | 5.16 | 72.398 | 27.602 | 90.828 | 9.172 | 5.16 | 224 | 0.875 | bicubic | regnetx_004.pycls_in1k | 0.000069 | regnetx |
# plot the data
import plotly.express as px
w,h = 1000, 800
def show_all(df, title, size):
return px.scatter(df,
width=w,
height=h,
size=df[size]**2,
title=title,
x='secs',
y='top1',
log_x=True,
color='family',
hover_name='model_org',
hover_data=[size]
)
show_all(df, 'Inference', 'infer_img_size')# plot a subset of the data
subs = 'levit|resnetd?|regnetx|vgg|convnext.*|efficientnetv2|beit|swin'
def show_subs(df, title, size, subs):
df_subs = df[df.family.str.fullmatch(subs)]
return px.scatter(df_subs,
width=w,
height=h,
size=df_subs[size]**2,
title=title,
trendline='ols',
trendline_options={'log_x':True},
x='secs',
y='top1',
log_x=True,
color='family',
hover_name='model_org',
hover_data=[size])
show_subs(df, 'Inference', 'infer_img_size', subs)# plot inference speed vs parameter count
px.scatter(df,
width=w,
height=h,
x='param_count_x',
y='secs',
log_x=True,
log_y=True,
color='infer_img_size',
hover_name='model_org',
hover_data=['infer_samples_per_sec', 'family']
)# repeat plots for training data
tdf = get_data('train', 'train_samples_per_sec')
show_all(tdf, 'Training', 'train_img_size')# subset of training data
show_subs(tdf, 'Training', 'train_img_size', subs)How does a neural net really work?
In this section, I’ll recreate the content in Jeremy’s notebook here, where he walks through a quadratic example of training a function to match the data.
A neural network layer:
- Multiplies each input by a number of values. These values are known as parameters.
- Adds them up for each group of values.
- Replaces the negative numbers with zeros.
# helper functions
from ipywidgets import interact
from fastai.basics import *# helper functions
plt.rc('figure', dpi=90)
def plot_function(f, title=None, min=-2.1, max=2.1, color='r', ylim=None):
x = torch.linspace(min,max, 100)[:,None]
if ylim: plt.ylim(ylim)
plt.plot(x, f(x), color)
if title is not None: plt.title(title)In the plot_function definition, I’ll look into why [:,None] is added after torch.linspace(min, max, 100)
torch.linspace(-1, 1, 10), torch.linspace(-1, 1, 10).shape(tensor([-1.0000, -0.7778, -0.5556, -0.3333, -0.1111, 0.1111, 0.3333, 0.5556,
0.7778, 1.0000]),
torch.Size([10]))
torch.linspace(-1, 1, 10)[:,None], torch.linspace(-1, 1, 10)[:,None].shape(tensor([[-1.0000],
[-0.7778],
[-0.5556],
[-0.3333],
[-0.1111],
[ 0.1111],
[ 0.3333],
[ 0.5556],
[ 0.7778],
[ 1.0000]]),
torch.Size([10, 1]))
[:, None] adds a dimension to the tensor.
Next he fits a quadratic function to data:
def f(x): return 3*x**2 + 2*x + 1
plot_function(f, '$3x^2 + 2x + 1$')
In order to simulate “finding” or “learning” the right model fit, he creates a general quadratic function:
def quad(a, b, c, x): return a*x**2 + b*x + cand uses partial to make new quadratic functions:
def mk_quad(a, b, c): return partial(quad, a, b, c)# recreating original quadratic with mk_quad
f2 = mk_quad(3, 2, 1)
plot_function(f2)
f2functools.partial(<function quad at 0x148c6d000>, 3, 2, 1)
quad<function __main__.quad(a, b, c, x)>
Next he simulates noisy measurements of the quadratic f:
# `scale` parameter is the standard deviation of the distribution
def noise(x, scale): return np.random.normal(scale=scale, size=x.shape)
# noise function matches quadratic x + x^2 (with noise) + constant noise
def add_noise(x, mult, add): return x * (1+noise(x, mult)) + noise(x,add)np.random.seed(42)
x = torch.linspace(-2, 2, steps=20)[:, None]
y = add_noise(f(x), 0.15, 1.5)# values match Jeremy's
x[:5], y[:5](tensor([[-2.0000],
[-1.7895],
[-1.5789],
[-1.3684],
[-1.1579]]),
tensor([[11.8690],
[ 6.5433],
[ 5.9396],
[ 2.6304],
[ 1.7947]], dtype=torch.float64))
plt.scatter(x, y)<matplotlib.collections.PathCollection at 0x148e16320>

# overlay data with variable quadratic
@interact(a=1.1, b=1.1, c=1.1)
def plot_quad(a, b, c):
plt.scatter(x, y)
plot_function(mk_quad(a, b, c), ylim=(-3,13))Important note changing sliders: only after changing b and c values do you realize that a also needs to be changed.
Next, he creates a measure for how well the quadratic fits the data, mean absolute error (distance from each data point to the curve).
def mae(preds, acts): return (torch.abs(preds-acts)).mean()# update interactive plot
@interact(a=1.1, b=1.1, c=1.1)
def plot_quad(a, b, c):
f = mk_quad(a,b,c)
plt.scatter(x,y)
loss = mae(f(x), y)
plot_function(f, ylim=(-3,12), title=f"MAE: {loss:.2f}")In a neural network we’ll have tens of millions or more parameters to fit and thousands or millions of data points to fit them to, which we can’t do manually with sliders. We need to automate this process.
If we know the gradient of our mae() function with respect to our parameters, a, b and c, then that means we know how adjusting a parameter will change the function. If, say, a has a negative gradient, then we know increasing a will decrease mae(). So we find the gradient of the parameters with respect to the loss function and adjust our parameters a bit in the opposite direction of the gradient sign.
To do this we need a function that will take the parameters as a single vector:
def quad_mae(params):
f = mk_quad(*params)
return mae(f(x), y)# testing it out
# should equal 2.4219
quad_mae([1.1, 1.1, 1.1])tensor(2.4219, dtype=torch.float64)
# pick an arbitrary starting point for our parameters
abc = torch.tensor([1.1, 1.1, 1.1])
# tell pytorch to calculate its gradients
abc.requires_grad_()
# calculate loss
loss = quad_mae(abc)
losstensor(2.4219, dtype=torch.float64, grad_fn=<MeanBackward0>)
# calculate gradients
loss.backward()
# view gradients
abc.gradtensor([-1.3529, -0.0316, -0.5000])
# increase parameters to decrease loss based on gradient sign
with torch.no_grad():
abc -= abc.grad*0.01
loss = quad_mae(abc)
print(f'loss={loss:.2f}')loss=2.40
The loss has gone down from 2.4219 to 2.40. We’re moving in the right direction.
The small number we multiply gradients by is called the learning rate and is the most important hyper-parameter to set when training a neural network.
# use a loop to do a few more iterations
for i in range(10):
loss = quad_mae(abc)
loss.backward()
with torch.no_grad(): abc -= abc.grad*0.01
print(f'step={i}; loss={loss:.2f}')step=0; loss=2.40
step=1; loss=2.36
step=2; loss=2.30
step=3; loss=2.21
step=4; loss=2.11
step=5; loss=1.98
step=6; loss=1.85
step=7; loss=1.72
step=8; loss=1.58
step=9; loss=1.46
The loss continues to decrease. Here are our parameters and their gradients at this stage:
abctensor([1.9634, 1.1381, 1.4100], requires_grad=True)
abc.gradtensor([-13.4260, -1.0842, -4.5000])
A neural network can approximate any computable function, given enough parameters using two key steps:
- Matrix multiplication.
- The function \(max(x,0)\), which simply replaces all negative numbers with zero.
The combination of a linear function and \(max\) is called a rectified linear unit and can be written as:
def rectified_linear(m,b,x):
y = m*x+b
return torch.clip(y, 0.)plot_function(partial(rectified_linear, 1, 1))
# we can do the same thing using PyTorch
import torch.nn.functional as F
def rectified_linear2(m,b,x): return F.relu(m*x+b)
plot_function(partial(rectified_linear2, 1,1))
Create an interactive ReLU:
@interact(m=1.5, b=1.5)
def plot_relu(m, b):
plot_function(partial(rectified_linear, m, b), ylim=(-1,4))Observe what happens when we add two ReLUs together:
def double_relu(m1,b1,m2,b2,x):
return rectified_linear(m1,b1,x) + rectified_linear(m2,b2,x)
@interact(m1=-1.5, b1=-1.5, m2=1.5, b2=1.5)
def plot_double_relu(m1, b1, m2, b2):
plot_function(partial(double_relu, m1,b1,m2,b2), ylim=(-1,6))Creating a triple ReLU function to fit our data:
def triple_relu(m1,b1,m2,b2,m3,b3,x):
return rectified_linear(m1,b1,x) + rectified_linear(m2,b2,x) + rectified_linear(m3,b3,x)
def mk_triple_relu(m1,b1,m2,b2,m3,b3): return partial(triple_relu, m1,b1,m2,b2,m3,b3)
@interact(m1=-1.5, b1=-1.5, m2=0.5, b2=0.5, m3=1.5, b3=1.5)
def plot_double_relu(m1, b1, m2, b2, m3, b3):
f = mk_triple_relu(m1,b1,m2,b2,m3,b3)
plt.scatter(x,y)
loss = mae(f(x), y)
plot_function(f, ylim=(-3,12), title=f"MAE: {loss:.2f}")This same approach can be extended to functions with 2, 3, or more parameters. Drawing squiggly lines through some points is literally all that deep learning does. The above steps will, given enough time and enough data, create (for example) an owl recognizer if you feed it enough owls and non-owls.
We can could do thousands of computations on a GPU instead of the above CPU computation. We can greatly reduce the amount of computation and data needed by using a convolution instead of a matrix multiplication. We could make things much faster if, instead of starting with random parameters, we start with parameters of someone else’s model that does something similar to what we want (transfer learning).
Gradient Descent with Microsoft Excel
Following the instructions in the fastai course lesson video, I’ve created a Microsoft Excel deep learning model here for the Titanic Kaggle data.
As shown in the course video, I trained three different models—linear regression, neural net (using SUMPRODUCT) and neural net (using MMULT). After running Microsoft Excel’s Solver, I got the final (different than video) mean loss for each model:
- linear: 0.14422715
- nnet: 0.14385956
- mmult: 0.14385956
The linear model loss in the video was about 0.10 and the neural net loss was about 0.08. So, my models didn’t do as well.
Book Notes
In this section, I’ll take notes while reading Chapter 4 in the fastai textbook.
Pixels: The Foundations of Computer Vision
- We’ll use the MNIST dataset for our experiments, which contains handwritten digits.
- MNIST is collected by the National Institute of Standards and Technology and collated into a machine learning dataset by Yann Lecun who used MNIST in 1998 in LeNet-5, the first computer system to demonstrate practically useful recognition of handwritten digits.
- We’ve seen that the only consisten trait among every fast.ai student who’s gone on to be a world-class practitioner is that they are all very tenacious.
- In this chapter we’ll create a model that can classify any image as a 3 or a 7.
from fastai.vision.all import *path = untar_data(URLs.MNIST_SAMPLE)# ls method added by fastai
# lists the count of items
path.ls()(#3) [Path('/root/.fastai/data/mnist_sample/labels.csv'),Path('/root/.fastai/data/mnist_sample/train'),Path('/root/.fastai/data/mnist_sample/valid')]
(path/'train').ls()(#2) [Path('/root/.fastai/data/mnist_sample/train/3'),Path('/root/.fastai/data/mnist_sample/train/7')]
# 3 and 7 are the labels
threes = (path/'train'/'3').ls().sorted()
sevens = (path/'train'/'7').ls().sorted()
threes(#6131) [Path('/root/.fastai/data/mnist_sample/train/3/10.png'),Path('/root/.fastai/data/mnist_sample/train/3/10000.png'),Path('/root/.fastai/data/mnist_sample/train/3/10011.png'),Path('/root/.fastai/data/mnist_sample/train/3/10031.png'),Path('/root/.fastai/data/mnist_sample/train/3/10034.png'),Path('/root/.fastai/data/mnist_sample/train/3/10042.png'),Path('/root/.fastai/data/mnist_sample/train/3/10052.png'),Path('/root/.fastai/data/mnist_sample/train/3/1007.png'),Path('/root/.fastai/data/mnist_sample/train/3/10074.png'),Path('/root/.fastai/data/mnist_sample/train/3/10091.png')...]
# view one of the images
im3_path = threes[1]
im3 = Image.open(im3_path)
im3
# the image is stored as numbers
array(im3)[4:10, 4:10]array([[ 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 29],
[ 0, 0, 0, 48, 166, 224],
[ 0, 93, 244, 249, 253, 187],
[ 0, 107, 253, 253, 230, 48],
[ 0, 3, 20, 20, 15, 0]], dtype=uint8)
# same thing, but a PyTorch tensor
tensor(im3)[4:10, 4:10]tensor([[ 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 29],
[ 0, 0, 0, 48, 166, 224],
[ 0, 93, 244, 249, 253, 187],
[ 0, 107, 253, 253, 230, 48],
[ 0, 3, 20, 20, 15, 0]], dtype=torch.uint8)
# use pandas.DataFrame to color code the array
im3_t = tensor(im3)
df = pd.DataFrame(im3_t[4:15, 4:22])
df.style.set_properties(**{'font-size': '6pt'}).background_gradient('Greys')| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 29 | 150 | 195 | 254 | 255 | 254 | 176 | 193 | 150 | 96 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 48 | 166 | 224 | 253 | 253 | 234 | 196 | 253 | 253 | 253 | 253 | 233 | 0 | 0 | 0 |
| 3 | 0 | 93 | 244 | 249 | 253 | 187 | 46 | 10 | 8 | 4 | 10 | 194 | 253 | 253 | 233 | 0 | 0 | 0 |
| 4 | 0 | 107 | 253 | 253 | 230 | 48 | 0 | 0 | 0 | 0 | 0 | 192 | 253 | 253 | 156 | 0 | 0 | 0 |
| 5 | 0 | 3 | 20 | 20 | 15 | 0 | 0 | 0 | 0 | 0 | 43 | 224 | 253 | 245 | 74 | 0 | 0 | 0 |
| 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 249 | 253 | 245 | 126 | 0 | 0 | 0 | 0 |
| 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 14 | 101 | 223 | 253 | 248 | 124 | 0 | 0 | 0 | 0 | 0 |
| 8 | 0 | 0 | 0 | 0 | 0 | 11 | 166 | 239 | 253 | 253 | 253 | 187 | 30 | 0 | 0 | 0 | 0 | 0 |
| 9 | 0 | 0 | 0 | 0 | 0 | 16 | 248 | 250 | 253 | 253 | 253 | 253 | 232 | 213 | 111 | 2 | 0 | 0 |
| 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 43 | 98 | 98 | 208 | 253 | 253 | 253 | 253 | 187 | 22 | 0 |
The background white pixels are stored a the number 0, black is the number 255, and shades of grey between the two. The entire image contains 28 pixels across and 28 pixels down for a total of 768 pixels.
How might a computer recognize these two digits?
Ideas:
3s and 7s have distinct features. A seven has generally two straight lines at different angles, a three as two sets of curves stacked on each other. The point where the two curves intersect could be a recognizable feature of the the digit three. The point where the two straight-ish lines intersect could be a recognizable feature of the digit seven. One feature of confusion could be handwritten threes with a straight line at the top, similar to a seven. Another feature of confusion could be a handwritten 3 with a straight-ish ending stroke at the bottom, matching a similar stroke of a 7.
First Try: Pixel Similarity
Idea: find the average pixel value for every pixel of the 3s, then do the same for the 7s. To classify an image, see which of the two ideal digits the image is most similar to.
Baseline: A simple model that you are confident should perform reasonably well. It should be simple to implement and easy to test, so that you can then test each of your improved ideas and make sure they are always better than your baseline. Without starting with a sensible baseline, it is difficult to know whether your super-fancy models are any good.
# list comprehension of all digit images
seven_tensors = [tensor(Image.open(o)) for o in sevens]
three_tensors = [tensor(Image.open(o)) for o in threes]
len(three_tensors), len(seven_tensors)(6131, 6265)
# use fastai's show_image to display tensor images
show_image(three_tensors[1]);
For every pixel position, we want to compute the average over all the images of the intensity of that pixel. To do this, combine all the images in this list into a single three-dimensional tensor.
When images are floats, the pixel values are expected to be between 0 and 1.
stacked_sevens = torch.stack(seven_tensors).float()/255
stacked_threes = torch.stack(three_tensors).float()/255
stacked_threes.shapetorch.Size([6131, 28, 28])
# the length of a tensor's shape is its rank
# rank is the number of axes and dimensions in a tensor
# shape is the size of each axis of a tensor
len(stacked_threes.shape)3
# rank of a tensor
stacked_threes.ndim3
We calculate the mean of all the image tensors by taking the mean along dimension 0 of our stacked, rank-3 tensor. This is the dimension that indexes over all the images.
mean3 = stacked_threes.mean(0)
mean3.shapetorch.Size([28, 28])
show_image(mean3);
This is the ideal number 3 based on the dataset. It’s saturated where all the images agree it should be saturated (much of the background, the intersection of the two curves, and top and bottom curve), but it becomes wispy and blurry where the images disagree.
# do the same for sevens
mean7 = stacked_sevens.mean(0)
show_image(mean7);
How would I calculate how similar a particular image is to each of our ideal digits?
I would take the average of the absolute difference between each pixel’s intensity and the corresponding mean digit pixel intensity. The lower the average difference, the closer the digit is to the ideal digit.
# sample 3
a_3 = stacked_threes[1]
show_image(a_3);
L1 norm = Mean of the absolute value of differences.
Root mean squared error (RMSE) = square root of mean of the square of differences.
# L1 norm
dist_3_abs = (a_3 - mean3).abs().mean()
# RMSE
dist_3_sqr = ((a_3 - mean3)**2).mean().sqrt()
dist_3_abs, dist_3_sqr(tensor(0.1114), tensor(0.2021))
# L1 norm
dist_7_abs = (a_3 - mean7).abs().mean()
# RMSE
dist_7_sqr = ((a_3 - mean7)**2).mean().sqrt()
dist_7_abs, dist_7_sqr(tensor(0.1586), tensor(0.3021))
For both L1 norm and RMSE, the distance between the 3 and the “ideal” 3 is less than the distance to the ideal 7, so our simple model will give the right prediction in this case.
Both distances are provided in PyTorch:
F.l1_loss(a_3.float(), mean7), F.mse_loss(a_3, mean7).sqrt()(tensor(0.1586), tensor(0.3021))
MSE = mean squared error.
MSE will penalize bigger mistakes more heavily (and be lenient with small mistakes) than L1 norm.
NumPy Arrays and PyTorch Tensors
A NumPy array is a multidimensional table of data with all items of the same type.
jagged array: nested arrays of different sizes.
If the items of the array are all of simple type such as integer or float, NumPy will store them as a compact C data structure in memory.
PyTorch tensors cannot be jagged. PyTorch tensors can live on the GPU. And can calculate their derivatives.
# creating arrays and tensors
data = [[1,2,3], [4,5,6]]
arr = array(data)
tns = tensor(data)
arrarray([[1, 2, 3],
[4, 5, 6]])
tnstensor([[1, 2, 3],
[4, 5, 6]])
# select a row
tns[1]tensor([4, 5, 6])
# select a column
tns[:,1]tensor([2, 5])
# slice
tns[1, 1:3]tensor([5, 6])
# standard operators
tns + 1tensor([[2, 3, 4],
[5, 6, 7]])
# tensor type
tns.type()'torch.LongTensor'
# tensor changes type when needed
(tns * 1.5).type()'torch.FloatTensor'
Computing Metrics Using Broadcasting
metric = a number that is calculated based on the predictions of our model and the correct labels in our dataset in order to tell us how good our model is.
Calculate the metric on the validation set.
valid_3_tens = torch.stack([tensor(Image.open(o)) for o in (path/'valid'/'3').ls()])
valid_3_tens = valid_3_tens.float()/255
valid_7_tens = torch.stack([tensor(Image.open(o)) for o in (path/'valid'/'7').ls()])
valid_7_tens = valid_7_tens.float()/255
valid_3_tens.shape, valid_7_tens.shape(torch.Size([1010, 28, 28]), torch.Size([1028, 28, 28]))
# measure distance between image and ideal
def mnist_distance(a,b): return (a-b).abs().mean((-1,-2))
mnist_distance(a_3, mean3)tensor(0.1114)
# calculate mnist_distance for digit 3 validation images
valid_3_dist = mnist_distance(valid_3_tens, mean3)
valid_3_dist, valid_3_dist.shape(tensor([0.1109, 0.1202, 0.1276, ..., 0.1357, 0.1262, 0.1157]),
torch.Size([1010]))
PyTorch broadcasts mean3 to each of the 1010 valid_3_dist tensors in order to calculate the distance. It doesn’t actually copy mean3 1010 times. It does the whole calculation in C (or CUDA for GPU).
In mean((-1, -2)), the tuple (-1, -2) represents a range of axes. This tells PyTorch that we want to take the mean ranging over the values indexed by the last two axes of the tensor—the horizontal and the vertical dimensions of an image.
If the distance between the digit in question and the ideal 3 is less than the distance to the ideal 7, then it’s a 3:
def is_3(x): return mnist_distance(x, mean3) < mnist_distance(x, mean7)is_3(a_3), is_3(a_3).float()(tensor(True), tensor(1.))
# full validation set---thanks to broadcasting
is_3(valid_3_tens)tensor([ True, True, True, ..., False, True, True])
# calculate accuracy
accuracy_3s = is_3(valid_3_tens).float().mean()
accuracy_7s = (1 - is_3(valid_7_tens).float()).mean()
accuracy_3s, accuracy_7s, (accuracy_3s + accuracy_7s) / 2(tensor(0.9168), tensor(0.9854), tensor(0.9511))
We are getting more than 90% accuracy on both 3s and 7s. But they are very different looking digits and we’re classifying only 2 out of 10 digits, so we need to make a better model.
Stochastic Gradient Descent
Arthur Samuel’s description of machine learning
Suppose we arrange for some automatic means of testing the effectiveness of any current weight assignment in terms of actual performance and provide a mechanism for altering the weight assignment so as to maximize the performance. We need not go into the details of such a procedure to see that it could be made entirely automatic and to see that a machine so programmed would “learn” from its experience.
Our pixel similarity approach doesn’t have any weight assignment, or any way of improving based on testing the effectiveness of a weight assignment. We can’t improve our pixel similarity approach.
We could look at each individual pixel and come up with a set of weights for each, such that the highest weights are associated with those pixels most likely to be black for a particular category. For example, pixels toward the bottom right are not very likely to be activate for a 7, so they should have a low weight for a 7, but ther are likely to be activated for an 8, so they should have a high weight for an 8. This can be represented as a function and set of weight values for each possible category, for instance, the probability of being the number 8:
def pr_eight(x,w) = (x*w).sum()
X is the image, represented as a vector (with all the rows stacked up end to end into a single long line) and the weights are a vector W. We need some way to update the weights to make them a little bit better. We want to find the specific values for the vector W that cause the result of our function to be high for those images that are 8s and low for those images that are not. Searching for the best vector W is a way to search for the best function for recognizing 8s.
Steps required to turn this function into a machine learning classifier:
- Initialize the weights.
- For each image, use these weights to predict whether it appears to be a 3 or a 7.
- Based on these predictions, calculate how good the model is (its loss).
- Calculate the gradient, which measures for each weight how changing that weight would change the loss.
- Step (that is, change) all the weights based on that calculation.
- Go back to step 2 and repeat the process.
- Iterate until you decide to stop the training process (for instance, because the model is good enough or you don’t want to wait any longer).
Initialize: Initialize parameters to random values.
Loss: We need a function that will return a number that is small if the performance of the model is good (by convention).
Step: Gradients allow us to directly figure out in which direction and by roughly how much to change each weight.
Stop: Keep training until the accuracy of the model started getting worse or we ran out of time, or once the number of epochs we decided are complete.
Calculating Gradients
Create an example loss function:
def f(x): return x**2Pick a tensor value at which we want gradients:
xt = tensor(3.).requires_grad_()yt = f(xt)
yttensor(9., grad_fn=<PowBackward0>)
Calculate gradients (backpropagation–during the backward pass of the network, as opposed to forward pass which is where the activations are calculated):
yt.backward()View the gradients:
xt.gradtensor(6.)
The derivative of x**2 is 2*x. When x = 3 the derivative is 6, as calculated above.
Calculating vector gradients:
xt = tensor([3., 4., 10.]).requires_grad_()
xttensor([ 3., 4., 10.], requires_grad=True)
Add sum to our function so it takes a vector and returns a scalar:
def f(x): return (x**2).sum()yt = f(xt)
yttensor(125., grad_fn=<SumBackward0>)
yt.backward()
xt.gradtensor([ 6., 8., 20.])
If the gradients are very large, that may suggest that we have more adjustments to do, whereas if they are very small, that may suggest that we are close to the optimal value.
Stepping with a Learning Rate
Deciding how to change our parameters based on the values of the gradients—multiplying the gradient by some small number called the learning rate (LR):
w -= w.grad * lr
This is knowns as stepping your parameters using an optimization step.
If you pick a learning rate too low, that can mean having to do a lot of steps. If you pick a learning rate too high, that’s even worse, because it can result in the loss getting worse. If the learning rate is too high it may also “bounce” around.
An End-to-End SGD Example
Example: measuring the speed of a roller coaster as it went over the top of a hump. It would start fast, get slower as it went up the hill, and speed up again going downhill.
time = torch.arange(0,20).float(); timetensor([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13.,
14., 15., 16., 17., 18., 19.])
speed = torch.randn(20)*3 + 0.75*(time-9.5)**2 + 1
speedtensor([72.1328, 55.1778, 39.8417, 33.9289, 21.9506, 18.0992, 11.3346, 0.3637,
7.3242, 4.0297, 3.9236, 4.1486, 1.9496, 6.1447, 12.7890, 23.8966,
30.6053, 45.6052, 53.5180, 71.2243])
plt.scatter(time, speed);
We added a bit of random noise since measuring things manually isn’t precise.
What was the roller coaster’s speed? Using SGD, we can try to find a function that matches our observations. Guess that it will be a quadratic of the form a*(time**2) + (b*t) + c.
We want to distinguish clearly between the function’s input (the time when we are measuring the coaster’s speed) and its parameters (the values that define which quadratic we’re trying).
Collect parameters in one argument and separate t and params in the function’s signature:
def f(t, params):
a,b,c = params
return a*(t**2) + (b*t) + cDefine a loss function:
def mse(preds, targets): return ((preds-targets)**2).mean()Step 1: Initialize the parameters
params = torch.randn(3).requires_grad_()Step 2: Calculate the predictions
preds = f(time, params)Create a little function to see how close our predictions are to our targets:
def show_preds(preds, ax=None):
if ax is None: ax=plt.subplots()[1]
ax.scatter(time, speed)
ax.scatter(time, to_np(preds), color='red')
ax.set_ylim(-300,100)
show_preds(preds)
Step 3: Calculate the loss
loss = mse(preds, speed)
losstensor(11895.1143, grad_fn=<MeanBackward0>)
Step 4: Calculate the gradients
loss.backward()
params.gradtensor([-35554.0117, -2266.8909, -171.8540])
paramstensor([-0.5364, 0.6043, 0.4822], requires_grad=True)
Step 5: Step the weights
lr = 1e-5
params.data -= lr * params.grad.data
params.grad = NoneLet’s see if the loss has improved (it has) and take a look at the plot:
preds = f(time, params)
mse(preds, speed)tensor(2788.1594, grad_fn=<MeanBackward0>)
show_preds(preds)
Step 6: Repeat the process
def apply_step(params, prn=True):
preds = f(time, params)
loss = mse(preds, speed)
loss.backward()
params.data -= lr * params.grad.data
params.grad = None
if prn: print(loss.item())
return predsfor i in range(10): apply_step(params)2788.159423828125
1064.841552734375
738.7333984375
677.02001953125
665.3380737304688
663.1239013671875
662.7010498046875
662.6172485351562
662.59765625
662.5902709960938
_, axs = plt.subplots(1,4,figsize=(12,3))
for ax in axs: show_preds(apply_step(params, False), ax)
plt.tight_layout()
Step 7: Stop
We decided to stop after 10 epochs arbitrarily. In practice, we would watch the training and validation losses and our metrics to decide when to stop.
Summarizing Gradient Descent
- At the beginning, the weights of our model can be random (training from scratch) or come from a pretrained model (transfer learning).
- In both cases the model will need to learn better weights.
- Use a loss function to compare model outputs to targets.
- Change the weights to make the loss a bit lower by multiple gradients by the learning rate and subtracting from the parameters.
- Iterate until you have reached the lowest loss and then stop.
The MNIST Loss Function
Concatenate the images into a single tensor. view changes the shape of a tensor without changing its contents. -1 is a special parameter to view that means “make this axis as big as necessary to fit all the data”.
train_x = torch.cat([stacked_threes, stacked_sevens]).view(-1, 28*28)Use the label 1 for 3s and 0 for 7s. Unsqueeze adds a dimension of size one.
train_y = tensor([1]*len(threes) + [0]*len(sevens)).unsqueeze(1)
train_x.shape, train_y.shape(torch.Size([12396, 784]), torch.Size([12396, 1]))
PyTorch Dataset is required to return a tuple of (x,y) when indexed.
dset = list(zip(train_x, train_y))
x,y = dset[0]
x.shape,y(torch.Size([784]), tensor([1]))
Prepare the validation dataset:
valid_x = torch.cat([valid_3_tens, valid_7_tens]).view(-1, 28*28)
valid_y = tensor([1]*len(valid_3_tens) + [0]*len(valid_7_tens)).unsqueeze(1)
valid_dset = list(zip(valid_x, valid_y))
x,y = valid_dset[0]
x.shape, y(torch.Size([784]), tensor([1]))
Step 1: Initialize the parameters
We need an initially random weight for every pixel.
def init_params(size, std=1.0): return (torch.randn(size)*std).requires_grad_()weights = init_params((28*28,1))
weights.shapetorch.Size([784, 1])
\(y = wx + b\).
We created w (weights) now we need to create b (intercept or bias):
bias = init_params(1)
biastensor([-0.0313], requires_grad=True)
Step 2: Calculate the predictions
Prediction for one image
(train_x[0] * weights.T).sum() + biastensor([0.5128], grad_fn=<AddBackward0>)
In Python, matrix multiplication is represetend with the @ operator:
def linear1(xb): return xb@weights + bias
preds = linear1(train_x)
predstensor([[ 0.5128],
[-3.8324],
[ 4.9791],
...,
[ 3.0790],
[ 4.1521],
[ 0.3523]], grad_fn=<AddBackward0>)
To decide if an output represents a 3 or a 7, we can just check whether it’s greater than 0:
corrects = (preds>0.0).float() == train_y
correctstensor([[ True],
[False],
[ True],
...,
[False],
[False],
[False]])
corrects.float().mean().item()0.38964182138442993
Step 3: Calculate the loss
A very small change in the value of a weight will often not change the accuracy at all, and thus the gradient is 0 almost everywhere. It’s not useful to use accuracy as a loss function.
We need a loss function that when our weights result in slightly better predictions, gives us a slightly better loss.
In this case, what does “slightly better prediction mean”: if the correct answer is 3 (1), the score is a little higher, or if the correct answer is a 7 (0), the score is a little lower.
The loss function receives not the images themselves, but the predictions from the model.
The loss function will measure how distant each prediction is from 1 (if it should be 1) and how distant it is from 0 (if it should be 0) and then it will take the mean of all those distances.
def mnist_loss(predictions, targets):
return torch.where(targets==1, 1-predictions, predictions).mean()Try it out with sample predictions and targets:
trgts = tensor([1,0,1])
prds = tensor([0.9, 0.4, 0.2])
torch.where(trgts==1, 1-prds, prds)tensor([0.1000, 0.4000, 0.8000])
This function returns a lower number when predictions are more accurate, when accurate predictions are more confident and when inaccurate predictions are less confident.
Since we need a scalar for the final loss, mnist_loss takes the mean of the previous tensor:
mnist_loss(prds, trgts)tensor(0.4333)
mnist_loss assumes that predictions are between 0 and 1. We need to ensure that, using sigmoid, which always outputs a number between 0 and 1:
def sigmoid(x): return 1/(1+torch.exp(-x))plot_function(torch.sigmoid, title='Sigmoid', min=-4, max=4)
It’s also a smooth curve that only goes up, which makes it easier for SGD to find meaningful gradients. Update mnist+loss to first apply sigmoid to the inputs:
def mnist_loss(predictions, targets):
predictions = predictions.sigmoid()
return torch.where(targets==1, 1-predictions, predictions).mean()We already had a metric, which was overall accuracy. So why did we define a loss?
To drive automated learning, the loss must be a function that has a meaningful derivative. It can’t have big flat sections and large jumps, but instead must be reasonably smooth. This is why we designed a loss function that would respond to small changes in confidence level.
The loss function is calculated for each item in our dataset, and then at the end of an epoch, the loss values are all averaged and the overall mean is reported for the epoch.
It is important that we focus on metrics, rather than the loss, when judging the performance of a model.
SGD and Mini-Batches
The optimization step: change or update the weights based on the gradients.
To take an optimization step, we need to calculate the loss over one or more data items. Calculating the loss for the whole dataset would take a long time, calculating it for a single item would not use much information so it would result in an imprecise and unstable gradient.
Calculate the average loss for a few data items at a time (mini-batch). The number of data items in the mini-batch is called the batch-size.
A larger batch size means you will get a more accurate and stable estimate of your dataset’s gradients from the loss function, but it will take longer and you will process fewer mini-batches per epoch. Using batches of data works well for GPUs, but give the GPU too many items at once and it will run out of memory.
We get better generalization if we can vary things during training (like performing data augmentation). One simple and effective thing we can vary is what data items we put in each mini-batch. Randomly shuffly the dataset before we create mini-batches. The DataLoader will do the shuffling and mini-batch collation for you:
coll = range(15)
dl = DataLoader(coll, batch_size=5, shuffle=True)
list(dl)[tensor([10, 3, 8, 11, 0]),
tensor([6, 1, 7, 9, 4]),
tensor([12, 13, 5, 2, 14])]
For training, we want a collection containing independent and dependent variables. A Dataset in PyTorch is a collection containing tuples of independent and dependent variables.
ds = L(enumerate(string.ascii_lowercase))
ds(#26) [(0, 'a'),(1, 'b'),(2, 'c'),(3, 'd'),(4, 'e'),(5, 'f'),(6, 'g'),(7, 'h'),(8, 'i'),(9, 'j')...]
list(enumerate(string.ascii_lowercase))[:5][(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd'), (4, 'e')]
When we pass a Dataset to a Dataloader we will get back many batches that are themselves tuples of tensors representing batches of independent and dependent variables:
dl = DataLoader(ds, batch_size=6, shuffle=True)
list(dl)[(tensor([24, 2, 4, 8, 9, 13]), ('y', 'c', 'e', 'i', 'j', 'n')),
(tensor([23, 17, 6, 14, 25, 18]), ('x', 'r', 'g', 'o', 'z', 's')),
(tensor([22, 5, 7, 20, 3, 19]), ('w', 'f', 'h', 'u', 'd', 't')),
(tensor([ 0, 21, 12, 1, 16, 10]), ('a', 'v', 'm', 'b', 'q', 'k')),
(tensor([11, 15]), ('l', 'p'))]
Putting It All Together
In code, the process will be implemented something like this for each epoch:
for x,y in dl:
# calculate predictions
pred = model(x)
# calculate the loss
loss = loss_func(pred, y)
# calculate the gradients
loss.backward()
# step the weights
parameters -= parameters.grad * lrStep 1: Initialize the parameters
weights = init_params((28*28, 1))
bias = init_params(1)A DataLoader can be created from a Dataset:
dl = DataLoader(dset, batch_size=256)
xb,yb = first(dl)
xb.shape, yb.shape(torch.Size([256, 784]), torch.Size([256, 1]))
Do the same for the validation set:
valid_dl = DataLoader(valid_dset, batch_size=256)Create a mini-batch of size 4 for testing:
batch = train_x[:4]
batch.shapetorch.Size([4, 784])
preds = linear1(batch)
predstensor([[10.4546],
[ 9.4603],
[-0.2426],
[ 6.7868]], grad_fn=<AddBackward0>)
loss = mnist_loss(preds, train_y[:4])
losstensor(0.1404, grad_fn=<MeanBackward0>)
Step 4: Calculate the gradients
loss.backward()
weights.grad.shape, weights.grad.mean(), bias.grad(torch.Size([784, 1]), tensor(-0.0089), tensor([-0.0619]))
Create a function to calculate gradients:
def calc_grad(xb, yb, model):
preds = model(xb)
loss = mnist_loss(preds, yb)
loss.backward()Test it:
calc_grad(batch, train_y[:4], linear1)
weights.grad.mean(), bias.grad(tensor(-0.0178), tensor([-0.1238]))
Look what happens when we call it again:
calc_grad(batch, train_y[:4], linear1)
weights.grad.mean(), bias.grad(tensor(-0.0267), tensor([-0.1857]))
The gradients have changed. loss.backward adds the gradients of loss to any gradients that are currently stored. So we have to set the current gradients to 0 first:
weights.grad.zero_()
bias.grad.zero_();Methods in PyTorch whose names end in an underscore modify their objects in place.
Step 5: Step the weights
When we update the weights and biases based on the gradient and learning rate, we have to tell PyTorch not to take the gradient of this step. If we assign to the data attribute of a tensor, PyTorch will not take the gradient of that step. Here’s our basic training loop for an epoch:
def train_epoch(model, lr, params):
for xb,yb in dl:
calc_grad(xb, yb, model)
for p in params:
p.data -= p.grad*lr
p.grad.zero_()We want to check how we’re doing by looking at the accuracy of the validation set. To decide if an output represents a 3 (1) or a 7 (0) we can just check whether the prediction is greater than 0.
preds, train_y[:4](tensor([[10.4546],
[ 9.4603],
[-0.2426],
[ 6.7868]], grad_fn=<AddBackward0>),
tensor([[1],
[1],
[1],
[1]]))
(preds>0.0).float() == train_y[:4]tensor([[ True],
[ True],
[False],
[ True]])
# if preds is greater than 0 and the label is 1 -> correct 3 prediction
# if preds is not greater than 0 and the label is 0 -> correct 7 prediction
True == 1, False == 0(True, True)
Create a function to calculate validation accuracy:
def batch_accuracy(xb, yb):
preds = xb.sigmoid()
correct = (preds>0.5) == yb
return correct.float().mean()batch_accuracy(linear1(batch), train_y[:4])tensor(0.7500)
Put the batches back together:
def validate_epoch(model):
accs = [batch_accuracy(model(xb), yb) for xb,yb in valid_dl]
return round(torch.stack(accs).mean().item(), 4)Starting point accuracy:
validate_epoch(linear1)0.5703
Let’s train for 1 epoch and see if the accuracy improves:
lr = 1.
params = weights, bias
train_epoch(linear1, lr, params)
validate_epoch(linear1)0.6928
Step 6: Repeat the process
Then do a few more:
for i in range(20):
train_epoch(linear1, lr, params)
print(validate_epoch(linear1), end = ' ')0.852 0.9061 0.931 0.9418 0.9477 0.9569 0.9584 0.9594 0.9599 0.9633 0.9647 0.9652 0.9657 0.9662 0.9672 0.9677 0.9687 0.9696 0.9701 0.9696
We’re already about at the same accuracy as our “pixel similarity” approach.
Creating an Optimizer
Replace our linear function with PyTorch’s nn.Lienar module. A module is an object of a class that inherits from the PyTorch nn.Module class, and behaves identically to standard Python functions in that you can call them using parentheses and they will return the activations of a model.
nn.Linear does the same thing as our init_params and linear together. It contains both weights and biases in a single class:
linear_model = nn.Linear(28*28, 1)Every PyTorch module knows what parameters it has that can be trained; they are available through the parameters method:
w,b = linear_model.parameters()
w.shape, b.shape(torch.Size([1, 784]), torch.Size([1]))
We can use this information to create an optimizer:
class BasicOptim:
def __init__(self,params,lr): self.params,self.lr = list(params),lr
def step(self, *args, **kwargs):
for p in self.params: p.data -= p.grad.data * self.lr
def zero_grad(self, *args, **kwargs):
for p in self.params: p.grad = NoneWe can create our optimizer by passing in the model’s parameters:
opt = BasicOptim(linear_model.parameters(), lr)Simplify our training loop:
def train_epoch(model):
for xb,yb in dl:
# calculate the gradients
calc_grad(xb,yb,model)
# step the weights
opt.step()
opt.zero_grad()Our validation function doesn’t need to change at all:
validate_epoch(linear_model)0.3985
Put our training loop in a function:
def train_model(model, epochs):
for i in range(epochs):
train_epoch(model)
print(validate_epoch(model), end=' ')Similar results as the previous training:
train_model(linear_model, 20)0.4932 0.7959 0.8506 0.9136 0.9341 0.9492 0.9556 0.9629 0.9658 0.9683 0.9702 0.9717 0.9741 0.9746 0.9761 0.9766 0.9775 0.978 0.9785 0.979
fastai provides the SGD class that by default does the same thing as our BasicOptim:
linear_model = nn.Linear(28*28, 1)
opt = SGD(linear_model.parameters(), lr)
train_model(linear_model, 20)0.4932 0.8735 0.8174 0.9082 0.9331 0.9468 0.9546 0.9614 0.9653 0.9668 0.9692 0.9727 0.9736 0.9751 0.9756 0.9761 0.9775 0.978 0.978 0.9785
fastai provides Learner.fit which we can use instead of train_model. To create a Learner we first need to create a DataLoaders, by passing our training and validation DataLoaders:
dls = DataLoaders(dl, valid_dl)To create a Learner without using an application such as cnn_learner we need to pass in all the elements that we’ve created in this chapter: the DataLoaders, the model, the optimization function (which will be passed the parameters), the loss function, and optionally any metrics to print:
learn = Learner(dls, nn.Linear(28*28, 1), opt_func=SGD, loss_func=mnist_loss, metrics=batch_accuracy)learn.fit(10, lr=lr)| epoch | train_loss | valid_loss | batch_accuracy | time |
|---|---|---|---|---|
| 0 | 0.636474 | 0.503518 | 0.495584 | 00:00 |
| 1 | 0.550751 | 0.189374 | 0.840530 | 00:00 |
| 2 | 0.201501 | 0.178350 | 0.839549 | 00:00 |
| 3 | 0.087588 | 0.105257 | 0.912659 | 00:00 |
| 4 | 0.045719 | 0.076968 | 0.933759 | 00:00 |
| 5 | 0.029454 | 0.061683 | 0.947498 | 00:00 |
| 6 | 0.022817 | 0.052156 | 0.954367 | 00:00 |
| 7 | 0.019893 | 0.045825 | 0.962709 | 00:00 |
| 8 | 0.018424 | 0.041383 | 0.965653 | 00:00 |
| 9 | 0.017549 | 0.038113 | 0.967125 | 00:00 |
Adding a Nonlinearity
Adding a nonlinearity between two linear classifiers givs us a neural network.
def simple_net(xb):
res = xb@w1 + b1
res = res.max(tensor(0.0))
res = res@w2 + b2
return res# initialize weights
w1 = init_params((28*28, 30))
b1 = init_params(30)
w2 = init_params((30,1))
b2 = init_params(1)w1 has 30 output activations which means w2 must have 30 input activations so that they match. 30 output activations means that the first layer can construct 30 different features, each representing a different mix of pixels. You can change that 30 to anything you like to make the model more or less complex.
res.max(tensor(0.0)) is called a rectified linear unit or ReLU. It replaces every negative number with a zero.
plot_function(F.relu)
We need a nonlinearity becauase a series of any number of linear layers in a row can be replaced with a single linear layer with a different set of parameters.
The neural net can solve any computable problem to an arbitrarily high level of accuracy if you can find the right parameters w1 and w2 and if you make the matrices big enough.
We can replace our function with PyTorch:
simple_net = nn.Sequential(
nn.Linear(28*28, 30),
nn.ReLU(),
nn.Linear(30,1)
)nn.Sequential create a modeule that will call each of the listed layers or functions in turn. When using nn.Sequential PyTorch requires us to use the module version (nn.ReLU) and not the function version (F.relu). Modules are classes so you have to instantiate them.
learn = Learner(dls, simple_net, opt_func=SGD,
loss_func=mnist_loss, metrics=batch_accuracy)learn.fit(40, 0.1)| epoch | train_loss | valid_loss | batch_accuracy | time |
|---|---|---|---|---|
| 0 | 0.363529 | 0.409795 | 0.505888 | 00:00 |
| 1 | 0.165949 | 0.239534 | 0.792934 | 00:00 |
| 2 | 0.089140 | 0.117148 | 0.913150 | 00:00 |
| 3 | 0.056798 | 0.078107 | 0.941119 | 00:00 |
| 4 | 0.042071 | 0.060734 | 0.957311 | 00:00 |
| 5 | 0.034718 | 0.051121 | 0.962218 | 00:00 |
| 6 | 0.030605 | 0.045103 | 0.964181 | 00:00 |
| 7 | 0.027994 | 0.040995 | 0.966143 | 00:00 |
| 8 | 0.026145 | 0.037990 | 0.969087 | 00:00 |
| 9 | 0.024728 | 0.035686 | 0.970559 | 00:00 |
| 10 | 0.023585 | 0.033853 | 0.972522 | 00:00 |
| 11 | 0.022634 | 0.032346 | 0.973994 | 00:00 |
| 12 | 0.021826 | 0.031080 | 0.975466 | 00:00 |
| 13 | 0.021127 | 0.029996 | 0.976448 | 00:00 |
| 14 | 0.020514 | 0.029053 | 0.975957 | 00:00 |
| 15 | 0.019972 | 0.028221 | 0.976448 | 00:00 |
| 16 | 0.019488 | 0.027481 | 0.977920 | 00:00 |
| 17 | 0.019051 | 0.026818 | 0.978410 | 00:00 |
| 18 | 0.018654 | 0.026219 | 0.978410 | 00:00 |
| 19 | 0.018291 | 0.025677 | 0.978901 | 00:00 |
| 20 | 0.017958 | 0.025181 | 0.978901 | 00:00 |
| 21 | 0.017650 | 0.024727 | 0.980373 | 00:00 |
| 22 | 0.017363 | 0.024310 | 0.980864 | 00:00 |
| 23 | 0.017096 | 0.023925 | 0.980864 | 00:00 |
| 24 | 0.016846 | 0.023570 | 0.981845 | 00:00 |
| 25 | 0.016610 | 0.023241 | 0.982336 | 00:00 |
| 26 | 0.016389 | 0.022935 | 0.982336 | 00:00 |
| 27 | 0.016179 | 0.022652 | 0.982826 | 00:00 |
| 28 | 0.015980 | 0.022388 | 0.982826 | 00:00 |
| 29 | 0.015791 | 0.022142 | 0.982826 | 00:00 |
| 30 | 0.015611 | 0.021913 | 0.983317 | 00:00 |
| 31 | 0.015440 | 0.021700 | 0.983317 | 00:00 |
| 32 | 0.015276 | 0.021500 | 0.983317 | 00:00 |
| 33 | 0.015120 | 0.021313 | 0.983317 | 00:00 |
| 34 | 0.014969 | 0.021137 | 0.983317 | 00:00 |
| 35 | 0.014825 | 0.020972 | 0.983317 | 00:00 |
| 36 | 0.014686 | 0.020817 | 0.982826 | 00:00 |
| 37 | 0.014553 | 0.020671 | 0.982826 | 00:00 |
| 38 | 0.014424 | 0.020532 | 0.982826 | 00:00 |
| 39 | 0.014300 | 0.020401 | 0.982826 | 00:00 |
You can view the training process in learn.recorder:
plt.plot(L(learn.recorder.values).itemgot(2))
View the final accuracy:
learn.recorder.values[-1][2]0.982826292514801
At this point we have:
- A function that can solve any problem to any level of accuracy (the neural network) given the correct set of parameters.
- A way to find the best set of parameters for any function (stochastic gradient descent).
Going Deeper
We can add as many layers in our neural network as we want, as long as we add a nonlinearity between each pair of linear layers.
The deeper the model gets, the harder it is to optimize the parameters.
With a deeper model (one with more layers) we do not need to use as many parameters. We can use smaller matrices with more layers and get better results than we would get with larger matrices and few layers.
In the 1990s what held back the field for years was that so few researchers were experimenting with more than one nonlinearity.
Training an 18-layer model:
dls = ImageDataLoaders.from_folder(path)
learn = cnn_learner(dls, resnet18, pretrained=False,
loss_func=F.cross_entropy, metrics=accuracy)
learn.fit_one_cycle(1, 0.1)/usr/local/lib/python3.10/dist-packages/fastai/vision/learner.py:288: UserWarning: `cnn_learner` has been renamed to `vision_learner` -- please update your code
warn("`cnn_learner` has been renamed to `vision_learner` -- please update your code")
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=None`.
warnings.warn(msg)
| epoch | train_loss | valid_loss | accuracy | time |
|---|---|---|---|---|
| 0 | 0.098852 | 0.014919 | 0.996075 | 02:01 |
Jargon Recap
Activations: Numbers that are calculated (both by linear and nonlinear layers)
Parameters: Numbers that are randomly initialized and optimized (that is, the numbers that define the model).
Part of becoming a good deep learning practitioner is getting used to the idea of looking at your activations and parameters, and plotting the and testing whether they are behaving correctly.
Activations and parameters are all contained in tensors. The number of dimensions of a tensor is its rank.
A neural network contains a number of layers. Each layer is either linear or nonlinear. We generally alternate between these two kinds of layers in a neural network. Sometimes a nonlinearity is referred to as an activation function.
Key concepts related to SGD:
| Term | Meaning |
|---|---|
| ReLU | Function that returns 0 for negative numbers and doesn’t change positive numbers. |
| Mini-batch | A small group of inputs and labels gathered together in two arrays. A gradient descent is updated on this batch (rather than a whole epoch). |
| Forward pass | Applying the model to some input and computing the predictions. |
| Loss | A value that represents how well or badly our model is doing. |
| Gradient | The derivative of the loss with respect to some parameter of the model. |
| Backward pass | Computing the gradients of the loss with respect to all model parameters. |
| Gradient descent | Taking a step in the direction opposite to the gradients to make the model parameters a little bit better. |
| Learning rate | The size of the step we take when applying SGD to update the parameters of the model. |
Questionnaire
1. How is a grayscale image represented on a computer? How about a color image?
Grayscale image pixels can be 0 (black) to 255 (white). Color image pixels have three values (Red, Green, Blue) where each value can be from 0 to 255.
2. How are the files and folders in the MNIST_SAMPLE dataset structured? Why?
path.ls()(#3) [Path('/root/.fastai/data/mnist_sample/labels.csv'),Path('/root/.fastai/data/mnist_sample/train'),Path('/root/.fastai/data/mnist_sample/valid')]
MNIST_SAMPLE path has a labels.csv file, a train folder, and a valid folder.
(path/'train').ls()(#2) [Path('/root/.fastai/data/mnist_sample/train/3'),Path('/root/.fastai/data/mnist_sample/train/7')]
The train folder has a 3 and a 7 folder, each which contains training images.
(path/'valid').ls()(#2) [Path('/root/.fastai/data/mnist_sample/valid/3'),Path('/root/.fastai/data/mnist_sample/valid/7')]
The valid folder contains a 3 and a 7 folder, each containing validation set images.
3. Explain how the “pixel similarity” approach to classifying digits works.
Pixel similarity works by calculating the absolute mean difference (L1 norm) between each image and the mean digit 3, and averaging the classification (if the absolute mean difference between the image and the ideal 3 is less than the absolute mean difference between the image and the ideal 7, it’s classified as a 3) across all images of each digit’s validation set as the accuracy of the model.
4. What is list comprehension? Create one now that selects odd numbers from a list and doubles them.
List comprehension is syntax for creating a new list based on another sequence or iterable (docs)
# for each element in range(10)
# if the modulo of the element and 2 is not 0
# double the element's value and store in this new list
doubled_odds = [2*elem for elem in range(10) if elem % 2 != 0]
doubled_odds[2, 6, 10, 14, 18]
5. What is a rank-3 tensor?
A rank-3 tensor is a “cube” (3-dimensional tensor).
6. What is the difference between tensor rank and shape? How do you get the rank from the shape?
Tensor rank is the number of dimensions of the tensor. Tensor shape is the number of elements in each dimension. The following tensor is a 2-dimensional tensor with rank 2, the shape of which is 3 elements by 2 elements.
a_tensor = tensor([[1,3], [4,5], [5,6]])
# dim == rank
a_tensor.dim(), a_tensor.shape(2, torch.Size([3, 2]))
7. What are RMSE and L1 norm?
RMSE = Root Mean Squared Error: The square root of the mean of squared differences between two sets of values.
L1 norm = mean absolute difference: the mean of the absolute value of differences between two sets of values.
8. How can you apply a calculation on thousands of numbers at once, many thousands of times faster than a Python loop?
You can do so by using tensors on a GPU.
9. Create a 3x3 tensor or array containing the numbers from 1 to 9. Double it. Select the bottom four numbers.
a_tensor = tensor([[1,2,3], [4,5,6], [7,8,9]])
a_tensortensor([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
a_tensor = 2 * a_tensor
a_tensortensor([[ 2, 4, 6],
[ 8, 10, 12],
[14, 16, 18]])
a_tensor.view(-1, 9)[0,-4:]tensor([12, 14, 16, 18])
10. What is broadcasting? Broadcasting is when a tensor of smaller rank (or a scalar) is expanded so that you can perform an operation between it and a tensor of larger rank. Broadcasting makes it so that the two operands have the same rank.
a_tensor + tensor([1,2,3])tensor([[ 3, 6, 9],
[ 9, 12, 15],
[15, 18, 21]])
- Are metrics generally calculated using the training set or the validation set? Why?
Metrics are calculated on the validation set because since that is the data the model does not see during training, the metric tells you how your model performs on data it hasn’t seen before.
12. What is SGD?
SGD is Stochastic Gradient Descent, an automated process where a model learns the right parameters needed to solve problems like image classification. The randomly (from scratch) or pretrained (transfer learning) parameters are updated using their gradients with respect to the loss and the learning rate. Metrics like the accuracy measure how well the model is performing.
13. Why does SGD use mini-batches?
One reason is to utilize the ability of a GPU to process a lot of data at once.
Another reason is that calculating the loss one image at a time leads to an unstable loss function whereas calculating the loss on the entire dataset takes too long. Mini-batches fall in between these two extremes.
14. What are the seven steps in SGD for machine learning?
- Initialize the weights.
- Calculate the predictions.
- Calculate the loss.
- Calculate gradients.
- Step the weights.
- Repeat the process.
- Stop.
15. How do we initialize the weights in a model?
Either randomly (if training from scratch) or using pretrained weights (if transfer learning from an existing model like resnet18).
16. What is loss?
A machine-friendly way to measure how well (or badly) the model is performing. The model is learning to step the weights in order to decrease the loss.
17. Why can’t we always use a high learning rate?
Because we risk overshooting the minimum loss (getting stuck back and forth between the two sides of the parabola) or diverging (resulting in larger losses each step).
18. What is a gradient?
The rate of change or derivative of one variable with respect to another variable. In our case, gradients are the ratio of change in loss to change in parameter at one point.
19. Do you need to know how to calculate gradients yourself?
Nope! Although you should understand the basic concept of derivatives. PyTorch calculates gradients with the .backward method.
20. Why can’t we use accuracy as a loss function?
Because small changes in predictions do not result in small changes in accuracy. Accuracy drastically jumps (from 0 to 1 in our MNIST_SAMPLE example) at one point, with 0 slope elsewhere. We want a smooth function where you can calculate non-zero and non-infinite derivatives everywhere.
21. Draw the sigmoid function. What is special about its shape?
The sigmoid function outputs between 0 and 1 for input values going from -inf to +inf. It also has a smooth positive slope everywhere so it’s easy to take the derivate.
plot_function(torch.sigmoid, title='Sigmoid', min=-4, max=4)
22. What is the difference between a loss function and a metric?
The loss function is a machine-friendly way to measure the performance of the model while a metric is a human-friendly way to do the same.
The purpose of the loss function is to provide a smooth function to take derivates over so the training system can change the weights little by little towards the optimum.
The purpose of the metric is to inform the human how well or badly the model is learning during training.
23. What is the function to calculate new weights using a learning rate?
In code, the function is:
parameters.data -= parameters.grad * lr
The new weights are stepped incrementally in the opposite direction of the gradients. If the gradient is negative, the weights will be increased. If the gradient is positive, the weights will be decreased.
24. What does the DataLoader class do?
The DataLoader class prepares training and validation batches and feeds them to the GPU during training. It also performs any necessary item_tfms or batch_tfms to the data.
25. Write pseudocode showing the basic steps taken in each epoch for SGD.
def train_epoch(model):
# calculate predictions
preds = model(xb)
# calculate the loss
loss = loss_func(preds, targets)
# calculate gradients
loss.backward()
# step the weights
params.data -= params.grad * lr
# reset the gradients
params.zero_grad_()
# calculate accuracy
acc = tensor([accuracy for each batch]).mean()- Create a function that, if passed two arguments
[1, 2, 3, 4]and'abcd', returns[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]. What is special about that output data structure?
def zipped_tuples(x, y): return list(zip(x,y))zipped_tuples([1,2,3,4], 'abcd')[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]
The output data structure is the same structure as the PyTorch Dataset.
27. What does view do in PyTorch?
view changes the rank and shape of the tensor.
tensor([1,2,3],[4,5,6]).view(3,2)tensor([[1, 2],
[3, 4],
[5, 6]])
tensor([1,2,3],[4,5,6]).view(6)tensor([1, 2, 3, 4, 5, 6])
28. What are the bias parameters in a neural network? Why do we need them?
The bias parameters are the intercept \(b\) in the function \(y = wx + b\). We need them for situations where the inputs are 0 (since \(w*0 = 0\)). Bias also helps to create a more flexible function (source).
29. What does the @ operator do in Python?
Matrix multiplication.
v1 = tensor([1,2,3])
v2 = tensor([4,5,6])
v1 @ v2tensor(32)
30. What does the backward method do?
Calculate the gradients of the loss function with respect to the parameters.
31. Why do we have to zero the gradients?
Each time you call .backward PyTorch will add the new gradients to the current gradients, so we need to zero the gradients to prevent them from accumulating.
32. What information do we have to pass to Learner?
Reference:
Learner(dls, simple_net, opt_func=SGD,
loss_func=mnist_loss, metrics=batch_accuracy)
We pass to the Learner:
DataLoaderscontaining training and validation sets.- The model we want to train.
- An optimizer function.
- A loss function.
- Any metrics we want calculated.
33. Show Python or pseudocode for the basic steps of a training loop.
See #25.
34. What is ReLU? Draw a plot for it for values from -2 to +2.
ReLU is Rectified Linear Unit. It’s a function where if the inputs are negative, they are set to zero, and if the inputs are positive, they are kept as is.
plot_function(F.relu, min=-2, max=2)
35. What is an activation function?
An activation function is the function that produces our predictions (in our case, a neural net with linear and nonlinear layers). Sometimes the ReLU is referred to as the activation function.
36. What’s the difference between F.relu and nn.ReLU?
F.relu is a function whereas nn.ReLU is a class that needs to be instantiated.
37. The universal approximation theorem shows that any function can be approximated as closely as needed using just one nonlinearity. So why wo we normally use more?
Using more layers results in more accurate models.